Wednesday, January 03, 2007

A Tale of Two Dubyas

Recently, after listening to our current President speak I decided to compare presidential speeches. I first started to look at text based networking methods to compare theme and usage biases, but then I read about readability scores. Although speeches are not meant to be read by anyone other than the speaker, I decided that a readability analysis should also work for speeches. There are many ways to test the readability of a text. Most of the readability tests express the result as a grade level that the reader should have finished to understand the given text. I used five different tests of readability, the Gunning fog index, the Coleman Liau index, the Flesch Kincaid Grade level, the Automated readability index and the SMOG. All of these indices are very similar and gave similar results.

The above graph shows results of all five averaged for each president. After I analyzed the State of the Union speeches for the presidents, I was curious to compare a formal speech with a less formal speech. I chose the election debates as a less formal speech. My thinking was that the State of the Union address is written by a team of professional writers while a debate is not. This is not to say that a president is not prepared by his staff before a debate. Of course he is. But, during a debate a president will have to answer on his own occasionally. My assumption was that the grade level will be higher for the more formal State of the Union address. In all cases this was true. If we use the difference between the two types of speech for each president as a measure of how much help the president receives for formal speeches we can get to my next assumption. I assumed that George W. Bush would score the highest, as needing the most help. This seems to be the case. Bush's speech level moves up 3.3 grade levels when he goes from a less formal speech to a formal speech. And, GWB and his father are the only presidents who have a debate grade level at the middle school level.

Note: Between the Kennedy-Nixon debates and the Carter-Ford debates there were no election debates held. This is why I did not include LBJ in the analysis. I obtained the text from all of the speeches and debates from The American Presidency Project. Also, I used an online text analyzer to obtain the readability scores.

Monday, November 13, 2006


I have often wondered how parents choose names for their children. And why some names become trendy. Since most Americans do not follow the traditional anglo-american naming patterns that were common during the colonial era, how do they choose the name of their baby? Also, do parents look for different attributes in a girl's name compared to a boy's name?

I have plotted the top 100 names for each year from 1880 to 2005 for both boys and girls. This led to a list of 438 boys names and 330 girls names. I plotted the names based on their peak occurance year (x-axis) versus the yearly average popularity (y-axis). Most of the names have a single peak, some have multiple peaks, but I use the highest peak year. Some names have a sharp peak, and are very popular for only five years, while some names have a very dull peak, but are popular for decades. Originally, I segregated the names based on the shape of their curves (sharp peaks, dull peaks, multiple peaks, etc.), but I finally decided to show this one plot. The first thing I noticed from this plot is that the names bunch up at the ends. Of course this is because we don't have data for the years before 1880 and for the future. Not having that data makes a number of names peak at both temporal ends. If we ignore the first and last 20 years to "control" for this effect you can see that from 1900 to 1985 the number of names increases for girls. There is no such increase for boys names. So why are girls names becoming increasingly diverse?

The reason that I didn't show individual name profiles was because there is already a site that does it so well. Please visit the baby name wizard and have fun. If anyone would like a higher resolution image of the above I will be happy to send it as a pdf.

Monday, July 24, 2006

Movie director clusters

Using the Sight and Sound poll of movie directors I made a network diagram based on common favorite movies. Each director provided his or her top ten list of best/favorite movies. The more movies that two directors had in common the closer the tie. From this analysis I partitioned the network into four clusters shown by coloring the nodes/vertices.

Interestingly, When I compared the four groups I found consistent differences compared to the distance the group is to the center. Group 1 (the center group) averaged 64 years old and 82% were from English speaking countries. While group 4 (the outer group) averaged 55 years old and 45% were from English speaking counties. Groups 2 and 3 had numbers between the two extremes.

When I compared which movies the groups picked I found that the directors from groups #1 and #2 liked Citizen Kane, 8 1/2, and The Godfather. The group #3 and #4 upstarts liked The Apartment, Tokyo Story, and The Mirror.

Saturday, April 29, 2006

Senate Voting Patterns, part 3

This is the final installment of the senate voting network analysis. I divided the last 64 months into eight equal periods and analyzed the senate voting patterns for each period. For an explanation of how I made the networks please see my last post. On a few of the graphs I have pointed out some of the senators, if there are specific senators that people are interested in I can identify them. Each individual network is shown below.

Above is a summary showing all eight networks (A) with additional plots. Part B of the above figure shows eight scatterplots as a view of how often senators vote within and outside their party (x-axis "% with Democrats", y-axis "% with Republicans"). The cross identifies the 50% mark. Part C of the above figure is an additional view of the same data. I made eight histograms of the difference between each senators vote within and outside the party. In this case, the right humps are made up of Republicans and the left Democrats. The final part (D) of the figure shows how the president's approval rating (based on many polls) has changed over the time period.

Although my original purpose for this blog was to simply show some interesting visuals of data, I do feel the need to point out some "findings". In the graphic above one can see the ebb and flow of polarization in the senate. The degree of polarization (best seen in B and C) changes with important events such as September 11th (less polarized), Invasion of Iraq (more polarized) and 2004 election (slightly less polarized). I find the period that included the invasion of Iraq as the most interesting. Although the senate is more polarized, based on a comparison of voting within and outside of a senator's party (again, above part B and C), the network does not show this. Instead, the network seems to lose much of its structure.

The other interesting thing that I have noticed from looking at these networks is that senators from New Hampshire and Arizona are consistently the most conservative. I am scoring conservative and liberal based on placement within the network (for a more thorough and quantitatively rigorous approach please see vote view). Although McCain is less consistent, he does tend to vote similar to his "Grand Canyon State" confederate, Kyl. Of the two, Arizona and New Hampshire, I find New Hampshire more interesting because of the disproportionate power that the New Hampshire voters possess. I'll leave it to others to debate how the out-dated primary system needs to be changed.

Monday, April 17, 2006

Senate Voting Patterns, part 2

Since my last post was well received I have decided to stretch the analysis of senate voting networks into additional posts. Also, I have included a graphic to help explain what these networks represent. All of the data presented in this post is from the U. S. Senate roll call votes for 2001.

The first figure, I hope, explains what these networks represent. You can see in part (A) there is a small table that is part of a large table of vote data. For 2001 there were 380 votes, here you can see ten votes for just three senators. I take all of the votes and compute a matrix similar the one in part (B). The matrix summarizes the percent of votes that senators have in common. I use the percent numbers to make networks. The links in the networks have a weight to them that is the same as the percent from the matrix. In Pajek, the network analysis program, I set thresholds to reduce the weak links. Part (C) show a network before and after the threshold.

In the networks that follow I have split 2001 into pre- and post- 9/11. Also, I have split each network into clusters. The clusters are based on partitioning that I performed using Pajek. To show the clusters I have simply boxed and numbered them. Following each network visualization I include a table showing cluster membership. The biggest difference between the pre-and post- 9/11 voting patterns is that after 9/11 the senate was less polarized. This was not a surprising finding, it will happen when there are 100-0 votes. But not all of the votes were unanimous and the changes were not symmetrical. The far right did not change very much, while the far left suddenly started to vote like conservative Democrats.

Another way to look at the voting changes after 9/11 is to plot how often senators voted within their party vs. outside their party. Again, the Democrats changed more than the Republicans. This can be seen best by comparing the boxes that I incuded that highlight the range or spread within each party. The Republicans retain a degree of heterogeneity, while the Democrats do not. You may remember that my previous post had a similar plot for 2005. From that it seems that the democrats have rebounded, by displaying a similar degree of voting heterogeneity that they had in the first eight months of 2001. Currently, I am analyzing data for the period between 2001 and 2005 to see when the rebound took place.

Thursday, April 06, 2006

Senate voting patterns

I have started analyzing the voting patterns in the Senate. Although this post involves political analysis, future posts may involve analysis of demographics, pop culture, sports, education, buying patterns, the stock market or anything else that I happen to be looking into at the time. I made a matrix that shows how often senators vote the same. For instance, Kerry and Kennedy voted the same direction 94% of the time, while Kerry and Frist cast the same vote only 34% of the time. All of these comparisons are corrected for the many times that senators do not vote. From the 100 x 100 matrix I was able to make graphs/networks of voting patterns by using various thresholds. The network below was made by using a threshold of >70%. I have highlighted a few senators. The networks were made with Pajek, a free social network analysis (SNA) program.
If we focus on the two major parties (treating Jeffords, an Independent, as a Democrat) we are able to see some internal structure.

By increasing the threshold we lose information but gain some insight into the core of each party. The core of the republican party is made up of what some might call "middle of the road" Republicans. While the core of the Democratic party is made up of the most liberal cluster of Democrats.

Another way to look at the data is to plot how each senator voted on average compared to all of the Democrats and all of the Republicans. The scatter plot below shows just that. I have highlighted McCain and Feingold as well as the "bridge senators". This view helps show how McCain's voting deviates from the other republicans. Some think of McCain as a moderate Republican, but he is anything but moderate. In every analysis he clusters with Sununu, Gregg, and Kyl, but does not vote similar to Chafee, Snowe, Collins or Specter, the true moderate republicans.