Is It Data?
22 Jan 2025 Posted in:digital humanities pedagogy Snow is on the ground, but I bundled up to make it to campus for week two of “Data for the Rest of Us,” a semester-long, two-credit introduction to data literacy from a humanities perspective. The general arc of the course takes students through all parts of the data construction pipeline and culminates in small groups developing datasets based around their own interests to share back with the class. This week’s topic was “Data Identification,” which I structured in two segments: theory and practice.
First, we developed a working definition of what qualifies as data. Core to this was the Wikipedia page for data, which offers a neat and actionable summary in the first paragraph: “Data (/ˈdeɪtə/ DAY-tə, US also /ˈdætə/ DAT-ə) are a collection of discrete or continuous values that convey information, describing the quantity, quality, fact, statistics, other basic units of meaning, or simply sequences of symbols that may be further interpreted formally.”
For our purposes, I simplified this definition as containing three key elements. Data is…
- A collection
- of units of meaning
- that may be interpreted formally.
We discussed each of these pieces separately:
- What counts as a collection? Who is doing the collecting? What does it mean to arrange things together? What structures facilitate collection or collecting? When does data become a dataset?
- What counts as a unit of meaning? What is meaning in the humanities? What are the humanities anyway? What kinds of meaning units have we used in the past week?
- What do we mean by formal interpretation? In this case, I specifically referred to the kinds of interpretation made possible by computers. We talked about what computers are good at, what they are bad at, and the kinds of compromises we might have to make to move between the two. We also talked about the kinds of formal interpretations that are possible: mapping, pattern recognition, averages, counting, and more.
In the second phase of the course, we practiced putting this definition into practice. I presented a series of objects to the students and asked them to apply our definition to it. Did this count as data on our terms? That is…was it a collection of meaningful information that could interpreted quantitatively? Why or why not? What units of information were there that could become data? How would we structure things if we wanted to convert this into data as defined above, such that we could work with it computationally?
We looked at:
- A spreadsheet with stuff in it
- An empty spreadsheet
- An apple watch loaded with personal metrics
- A diagram of cell phone metadata describing call counts, shortest paths between phones data structures
- A bookcase with books in it
- The bible opened to a particular page
- A movie poster
- A set of movie posters
- A Wikipedia page about a Civil War battle
- A spreadsheet full of information about twentieth-century wars
- The front page of Vogue
Some conversations were less lively, but there were some objects that brought out great observations. With the Bible, for example, students noted how certain structural elements like chapter number, verse number, and page number might be interesting units of meaning you might want to preserve. And they got there by noting how you cite material from it. I also talked a bit about the OCR process by way of orienting them towards the path by which a physical object might become a collection of words in a plain text file.
The other “this is not a pipe” moment came when I asked the students to talk about what was contained in a movie poster. After they named all the textual elements of the image, I pulled up the CSS Color Picker tool and talked about how images are also specific organizations of color data. So in addition to the textual information we understood as people, the computer was understanding things on a very visual level and in very quantified terms. This made a nice link to the Robots Reading Vogue project and brought out a discussion of how we can make meaningful interpretations from visual information.
We closed with a short discussion of “On Missing Data Sets” by way of encouraging the students to think about what is out there, what is not, and the kinds of values, systems, and resources that go into deciding what is collected and what is not. For homework, the students have to brainstorm some datasets that don’t exist that they are interested in. I’m, frankly, a little suspicious of how successful they will be at this. It seems very hard to me! But I wanted to throw them the challenge to see how they do. Even if they struggle mightily I think it will be worthwhile.
If I had to run this particular class again, I probably would divide the class into smaller groups to help facilitate discussion. We ultimately got where I wanted to go, but my sense was that the students might have needed a bit more help from me structuring the discussion to get there. I think that might have been accomplished by flipping the format - rather than a group discussion about particular topics and images, I would give each group a set of images and a set of questions to answer about them before we came back to discuss. I’m still finding my way in this class of all STEM students since I’m used to exclusively teaching humanities majors.
Overall, the class discussion sent the students into something of an existential tailspin. Is it data? Yes. But also no. Could be! Depends on how much work you want to put into the question. And much more of that work is to come.