Generate a “graphical” Zipf distribution for the entire text of Moby Dick
In line 12, we remove HTML tags from a version of the text found on the web. Line 14 splits the entire text into words, and continues with a Linq grouping expression that tallies distinct words into an anonymous type. After deriving a scaling factor for the graph, line 22 prints an ASCII histogram bar for each of the top 35 words.moby_dick.html
using System;
using System.IO;
using System.Linq;
using System.Text.RegularExpressions;
class MainClass
{
static void Main()
{
String text = new StreamReader("moby_dick.html").ReadToEnd();
text = Regex.Replace(text, "<(.|\n)*?>", String.Empty);
var tallies = text
.Split(" \n\",.;-!?".ToCharArray(), StringSplitOptions.RemoveEmptyEntries);
.GroupBy(w => w.ToLower())
.Select(g => new { g.Key, Tally = g.Count() })
.OrderByDescending(e => e.Tally);
int scale = tallies.First().Tally / 60;
foreach (var tally in tallies.Take(35))
Console.WriteLine("{0,6} {1}", tally.Key, new String('*', tally.Tally / scale));
}
};
Output.
the ************************************************************
of ***************************
and **************************
to *******************
a *******************
in *****************
that ************
his **********
it **********
i ********
but *******
he *******
as *******
with *******
is *******
was ******
for ******
all ******
this *****
at *****
by ****
not ****
from ****
him ****
so ****
on ****
whale ****
be ****
one ***
you ***
there ***
now ***
had ***
have ***
or **
