Data Collection Methodology
The data collection methodology for this was quite complicated, mainly due to the nature of the data itself.
Project Domain
I initially wasn’t sure how much data I wanted to scrape. After seeing this statista article about the most commonly consumed news syndicates, I picked the top 10.
These included:
- New York Times
- CNN
- Fox News
- New York Post
- BBC
- Washington Post
- USA Today
- Daily Mail
- CNBC
- The Guardian
I wanted to focus my attention to articles from these publications, specifically between the years 2007-2022. Unfortunately, anything prior to 2007 resulted in sporadic, inconsistent data.
Scraping
Existing APIs for gathering headlines were expensive, confusing, and didn’t have access to historical headlines. As a result, I had to settle to scrape from Archive.org’s Wayback machine.
javascript :(
As the data was being scraped by making a python webrequest, using the requests
library, if the website required javascript to be enabled for article data loading, it would not be scraped.
This led to a major gap in CNN data, which started initially around 2017. An alternative would have been to Use selenium to load the website, grab page source, and then feed that into BeautifulSoup4. I opted to just settle for the data discrepancy solely because of time constraints.
Retreiving URLS
With the Wayback Machine API, I could provide a date in the format of YYYYMMDD, and I would receive a link to closest snapshot to that date. With this in mind, I was able to generate a range of dates from 20070101
to 20221231
, feeding this into the Archive.org api.
Obviously, some dates had to be skipped as they were not available for these dates. Nowadays, everything is being archived, but towards the beginning of the dataset, there were a large amount of gaps.
Scraping each frontpage
This was the part I got incredibly lost with– initially I would search for certain CSS classes which indicated that the HTML element was a headline. This worked well for the more modern articles, but I quickly learned an important lesson: websites evolve. a lot.
As I kept patching up issues with the code, I realized that the web design was never consistent. I can’t blame them– my web design is never consistent either.
I had to go for a new approach gathering every h1
and h2
tag, and then filtering out non-headlines from those.
Thankfully, this was much more straightforward. Every website used some sort of HTML header tag to indicate a website, and fortunately, I could use BS4 to traverse the DOM, attempting to find a URL that matched said headline. Category links in the navigation bars of websites slipped through, so I had to filter those out by excluding headlines that were under 3 words.
Running the scraper
I had spent about 2 weeks now working on this scraper, on and off. I wasn’t sure how long it would’ve taken to run, but I figured that it would be a while. Preparing myself, I made sure to skip days that were already logged, just in case my script failed at some point. Good call, it did fail.
Spending a few extra hours wrapping everything in try catch statements, fixing DOM traversal, and just trying to not get banned from Archive.org, I finally got things running smoothly.
Doing the math, it would take around 2 days straight or so for the script to run. I was making about 50,000+ calls to the Archive.org API, for 50,000+ individual webpages. The major roadblock was loading these webpages, which could take upward of 10+ seconds.
I tried to write code which tried to force the webpage into Archive.org cache. I would make a request to the website, but immediately cancel afterwards, hoping that Archive.org would still store the webpage for the next 15 minutes or so. Looking back, this was probably not a great idea. It never worked, and probably just added extra stress on the Archive.org servers.
Having to deal with the inevitable 2 day wait time, I carried on, running the script and checking on it every hour to ensure that nothing goes wrong.
Getting blocked by Archive.org
The inevitable happened: I was getting ratelimited by Archive.org. No requests on my IP would go through, yet requests on a free public vpn would.
I got error 429, which eventually subsided. My anxiousness about this project was at an all time high though– but thanks to the “dont request if i already saved this day” bit of code I wrote, rerunning the script was no issue.
Completion! (of scraping)
Eventually, after a long and arduous wait, the data had finished scraping. I couldn’t even open the file at first, it would just immediately slow down VSCode.
When I finally got the file opened, I did a quick skim through, and it turns out I had 4.5 million headlines total– perfect!
I created a quick python script to generate an availability.csv file, which simply showed the dates that I had been able to collect data for. Importing this into google sheets immediately shows you the gaps and discrepancies in the data, which was a nice visualization.
I posted the data to kaggle, and that was that. 4.5million headlines scraped. But this was just the start.
Figuring out what to do with all this data
My initial idea for this project was to analyze the data with a bias detector, to graph political polarization across a timeframe. For now, I stuck to a simple bias and sentiment detector.
The bias detector was trained off the BABE dataset on RoBERTa. The sentiment detector was just pulled from huggingface.
Although this is a brief section, this part of the project took the longest. I probably trained upwards of 30+ models in this timeframe. I spent upwards of a month trying to learn about Artificial Intelligence, how to train neural language processing models, and how to organize training data.
I messed around with numerous datasets: The Manifesto Project, Custom Hand Labeled data, BABE/MBIC, etc. Eventually I settled on BABE/MBIC simply because I wanted a baseline bias detector, nothing too fancy.
Model Inferencing
After creating a bias detection model, and finding a decent sentiment analysis model. I selected a set of data to analyze. It consisted of 5 publications from 2012-2022, selecting about 100,000 headlines from each year. I hoped that this way, there would be an even distribution of data.
Once I had separated and selected the data that I wanted to analyze, now came inferencing the models on the dataset. Unfortunately, even with Kaggle/Google Colab’s free tier, it would take too long. The time it would take to analyze all of them would exceeed the session time limit.
So, I had two options. I could spend 1-2 hours of time trying to separate the data into batches, and then inferencing them on multiple platforms across mulitple sessions. This would take a total of 4 hours or so.
I went with the other option though, which was to spent 6 hours learning how to inference huggingface transformer models on PaperSpace GraphCore IPUs. It paid off though, it was incredibly satisfying to see the timer show ‘45 minutes ETA’ instead of ‘12 hours ETA’.
After running the model on each and every headline I had selected, I now had to analyze the results.
Result Analysis
Summary Generation
To make this easier, I wrote code that aggregated the sums of each day into a separate dataframe. I would log each day for each news publication, which would be a total of 365 days a year 10 publications 10 years = 36,500 rows of data.
While generating the summaries, I also made sure to count certain keywords as mentioned earlier, along with calculating percentages for each day.
I ran the code, generated the summaries, and finally had them all in one neat folder. It was time to try graphing it.
Graphing
To graph the data, initially, I just wrote a few lines of code with MatPlotLib. I liked the way these charts looked, they showed the general trend of media bias increasing over time.
To present this data, I obviously could have just uploaded screenshots to the website. However, I wanted some degree of interactivity. As a result, I ended up spending a day or two trying to learn Chart.JS, and the result is the graphs you see on the analysis page.
Conclusion
I learned a lot from this project, both software-development wise and AI-wise. I learned incredibly useful skills such as training AI models, handling data on such a large scale, and graphing libraries for the web. Yet, I think the most important part of this is that I learned my passion for combining computer science with investigating issues in the real world. Data science is such an incredibly complex field, but seeing the results of my data being applied to real world issues is incredibly satisfying. Hopefully this data can be used to help others in the future too, I’ve published it on Kaggle. Check the data page for more info.