Links
Overview
Collection
The data was collected by Jordan Krishnayah over the summer, written with some custom code. The scraped headlines consist of 4.5million or so lines from 2007-2022, and the analyzed data consists of 1.3million or so from 2012-2022.
Obviously, this data isn’t perfect– there are gaps and discrepancies in the data, especially due to the nature in which the dataset was scraped.
For more info on the methodology, and precisely how the data was collected, see the methodology page.
Metrics
A few metrics were collected for the data:
AI/NLP models collected sentiment and bias data for each headline.
Furthermore, I collected keywords, such as:
- “Trump”
- “Biden”
- “Obama”
- “Clinton”
The full CSV heading for the analyzed data is:
Day,Headlines,Biased,Not Biased,Positive,Neutral,Negative,keyword_trump,keyword_obama,
keyword_biden,keyword_clinton,keyword_republican,keyword_democrat,keyword_china,
keyword_russia,keyword_north korea,keyword_lgbt,keyword_gay,keyword_climate,keyword_nuclear,
keyword_congress,keyword_house,keyword_senate,keyword_supreme court,keyword_slams,keyword_blasts,
keyword_liberal,keyword_conservative,Percentage Negative,Percentage Neutral,Percentage Positive,
Percentage Biased,Percentage Not Biased,Negative to Positive Ratio,Biased to Not Biased Ratio,
Trump to Headline Ratio,Clinton to Headline Ratio
Raw Headlines
The raw data consists of 4 columns: Date, Publication, Headline, URL
Availability
To better understand what data is available and what days haven’t been scraped for what publiactions, check out availability.csv
The csv column looks like this:
Date, New York Times, CNN, FOX, New York Post, Washington Post, USA Today, Daily Mail, CNBC
You can also import this CSV file into a spreadsheet and utilize conditional formatting to display what is available and what is not.