Content Analysis

Sociologists don't always study people. When examining structure, we often study groups, such as families, organizations, communities or nations. But, we can also study things. Human-made objects, called SOCIAL ARTIFACTS, contain information about the society from which they come. Social artifacts can be things such as books, newspapers, advertisements, films, photographs, paintings, machines, buildings, and so forth — anything built by humans.

It's clear that we need a particular method for studying these things. Surveys, experiments and field research are designed to study people, not objects. Aggregate data describes groups. But, what we need is a way to extract information from social objects.

CONTENT ANALYSIS is a general method for studying social artifacts. Content analysis usually quantifies information in social artifacts, by means of a technique called CODING. When coding an object, the researchers make a judgment about an object according to a set of agreed upon dimensions. Coding usually requires more than one judge, so that the resulting data reflect a more objective view of the artifact.

Before judging the objects, researchers must first agree on a common set of dimensions, and specific definitions of the kinds of evidence to be sought.

The content to be coded for depends on what the researcher is interested in studying. Almost anything can be coded for, provided that a set of categories is developed and a set of explicit rules for assigning codes is employed.

In my resent research, I studied sermons, speeches and written texts of the Black Abolitionists in antebellum New York state. I am interested in the moral rhetoric used to argue against slavery. There are lots of different arguments that one could make to assert that slavery is evil, and I am trying to determine which arguments the Black Abolitionists used to convince their mostly white audiences in the state that the system of Southern slave labor ought to be abolished.

I am continuing to use the same methods to analyze contemporary contentious discourses. I have just completed one study of newspaper coverage of the creationism conflict. I am currently working on a study of coverage of the gay marriage issue. (If you would like to learn more about my research, take a look at one of my recent research papers, Radicalization of Religious Discourse in El Salvador, The Decline of the Public Sphere: A Semiotic Analysis of the Rhetoric of Race in New York City, The Rhetoric of Black Abolitionism, and, Conflict over Origins: a Discourse Analysis of the Creationism Dispute [PDF].)

In one phase of the abolitionism research, I coded for sixty themes in the texts, including, for example, slavery, liberty, equality, suffering, justice and labor. I want to know which themes are talked about most, and which least. To determine this, I counted the frequency of each theme, expressed as the number of paragraphs in which the theme occurs at least once. I also wanted to know which themes are talked about with which other themes. To determine this, I counted the co-occurence of the themes, expressed as the proportion of paragraphs in which two themes occur at least once.

Since content coding is generally labor intensive, I have developed a set of computer programs to code these texts automatically. Instead of taking hundreds of hours to code these texts, I can do it in a few minutes by computer.

For example, take a look at a speech by Frederick Douglass from 1852. See if you can identify the main themes of the speech.

It would take perhaps half of a hour to code for frequency and co-occurence in this text if you were to do it by hand. I coded it in seconds by computer. Here are the results. The results of the coding can be used with a variety of qualitative and quantitative procedures.

Researchers coding texts sometimes look at grammatical features, such as transitivity or declension. Researchers interested in ideology, for example, might code for active and passive voice. Speakers, especially in political texts like campaign speeches, sometimes use passive voice to disguise causal arguments.

When developing the analytical dimensions, researchers must choose to code for manifest or latent content. MANIFEST CONTENT is explicit in an object. If the artifacts involve text, the manifest content consists of the actual words and their denotations. If the artifacts are images, it refers to their observable features, such as shapes, colors or styles. LATENT CONTENT is implicit. It is the content that is often implied, but not present, in text or images. The researcher has to interpret the presence of latent content. It is somewhat more difficult to code for, because it requires a familiarity with the context of the objects. One may miss the connotations of words if one is not familiar with their particular idiom. Images may also contain symbolic references. Again, the researcher has to be familiar with the context to be able to perceive these references. Latent content can be coded for objectively, if the coders all rely on the same set of interpretive guidelines.
Sampling

Just like with survey research or field methods, content analysis requires some kind of sampling. Random sampling is the best way to get a sample that represents the population of the social artifacts to be studied. Even with a random sample, there are problems that can arise when studying artifacts. A researcher usually wants to make generalizations to all artifacts of a particular type, such as American television commercials or science fiction novels. Not all objects of a particular type are available for study. This can result in sampling bias.

The main problem of sampling bias confronting researcher of artifacts has to do with availability. DEPOSIT BIAS (or, SURVIVAL BIAS) occurs when certain categories of objects are more likely to be available for sampling. Not all of the artifacts of a particular type, such as speeches or photographs, are collected, or deposited, in archives. Not all the artifacts of a particular type produced survive to be collected. Most research takes place some time after the artifacts are produced, so if they are not archived, they are typically not available for study. Thus, if certain kinds of artifacts are omitted, the sample can only be generalized to the population of archived artifacts. This kind of bias most often results in the omission of amateur objects, since these are least likely to find their way into archives.

Reliability & Validity

The reliability of coding is checked by comparing the ratings of the judges; this is called INTERCODER RELIABILITY. To the extent to which the judges are in agreement, the data can be seen as reliable. Intercoder reliability is usually calculated in the form of a correlation coefficient, such as Alpha, Kappa, or Pearson's R. (Most correlation coefficients vary between 0 and 1, or -1 to 1, and measure the strength of the relationship between to variables--in this case, judges ratings. The closer the coefficient is to 1, the stronger the relationship. The closer the coefficient is to zero, the weaker it is. In most cases, researchers are not satisfied with intercoder reliabilities less than 0.75)

The validity of coding hinges on the appropriateness of the coding scheme. The codes must have FACE VALIDITY. They must seem, on the surface, to reflect the content of the objects being studied. The codes must also capture the relevant RANGE OF VARIATION in the context of the objects. If a researcher is studying the moral logic of sermons charismatic ministers, for example, the codes must reflect the kind of topics that these sermons might contain. If a researcher applies categories from mainline Protestant theology, the result will likely be a distorted view of the sermons. Finally, the codes must be appropriate to the time frame of the origin of the objects. Researchers often study artifacts produced in the past. The researcher must be careful to generate codes that are relevant to the context of the objects and not the researcher's own.