Are you sure you're reading news?
How AI can tell if an article is news or advertisement
Do you know when you're reading an advertisement or when you're reading a news article? This study has shown that only 7% of the people can distinguish advertorial from editorial content. But fear not! Timo Kats and Peter van der Putten have found the solution: their AI can distinguish the two types of content 90% of the time.
Timo Kats: “The outcome of the research was really a lot of percentages. Those are hard to picture. So I made two things so that people can interpret all these digits better: a lexicon and a word web. It’s nice to see the story behind the mountain of numbers that produced my paper.”
About the researchers
Timo Kats graduated from Computer Science and Economics at the University of Leiden. For his graduation thesis, he investigated the possibility of using AI to distinguish advertorial from editorial content. On November 11 2021, he presented his thesis as a paper at the BNAIC/BENELEARN in Luxembourg.
Peter van der Putten is an Assistant Professor at the LIACS, the Leiden Institute of Advanced Computer Science at the University of Leiden. He is both a researcher and teacher and coached Timo during his graduation.
Jasper Schelling is the initiator of the Reverb Channel programme at ACED. With Reverb Channel, he makes sure that researchers have the right conditions for their research. For example by gathering the right data, having the technical resources and connecting the right people.
The research question of Timo was: To what extent can we differentiate commercial and editorial content by using machine learning?
His first step was to gather information. Timo: “This was quite hard to do. I wanted to gather information from different newspapers, because I didn’t want a biased model. The problem is that lots of Dutch newspapers don’t use advertorials. Some newspapers were technically very hard to scrape. So I ended up with Nu.nl, NRC, Telegraaf and the Ondernemer -which is a bit of an entrepreneurial medium where you have quite a lot of advertorials.”
These four websites were then scraped. Scraping is a method of gathering information where a web crawler collects specific data from the web. Web crawlers are internet bots that browse the World Wide Web systematically. The information was stored in a central database, where it was later analysed.
Timo scraped 2.000 articles in total: 1.000 news articles and 1.000 advertorials. Timo: “I received some criticism that my research did not have enough data. 2.000 articles may sound like a lot, but it isn’t. The corpus of Reverb Channels has more than a million articles. But they probably have a server somewhere to scrape the articles. I only had my laptop.”
Gathering news articles was fairly easy. Gathering enough advertorials was more of a challenge. They disappear after a while. Peter: “The information we gathered was enough for a bachelor thesis. We showed that we can validate the results and that we learned something from it. Our goal was to create insight in how well you can distinguish between the two content types and to see what drove the prediction. And we wanted to show the world. That’s why we didn’t just produce a thesis, but a scientific paper as well. Making a paper out of a bachelor thesis is truly remarkable.”
This is the way
The next step was to find an accurate machine learning algorithm that could perform the task of distinguishing between the two types of content. One can find a lot of existing algorithms on the internet. Key is to find the one that works best. And that meant: testing. A lot of testing. Timo: “I found a couple of algorithms that worked, so I tweaked some parameters. I then just started ruling out the bad ones and I got one with really good results.”
The training of models went as follows: Timo got all 2.000 articles and ‘threw them onto a big pile’. He then used 90 percent of those articles to train the models. To check if the models worked, he used the other 10 percent of the articles. On that 20 percent, he got a 90 percent accuracy.
Webs and words and webs and words
Timo has made a lexicon and a word web to visualise which words are more often used in advertorials and which ones in editorials. Peter: “It’s interesting, because they aren’t just the subjects of an article. It’s also about the way it’s written about. Free and inspiring are both very popular in advertisements. And yes, kabinet is indeed often found in news articles. But so are the four w’s: what, when, where, why. That is really interesting.”
The articles of the Ondernemer were the hardest to differentiate. Peter: “I don’t think that’s evil intent, but more the nature of the beast. In a business to business magazine, it gets intermingled more easily.”
Like the word web and the lexicon, Timo has made a few applications with his research. There are other possibilities to put his research into practice. Timo: “Well, I think that this project is at its end. Since this is my bachelor thesis and I have yet since graduated, I look forward to other projects.”
And Reverb Channel? They will include this project in their selection of research. Jasper: “The technologies that Timo has used can be used by news organizations to recommend articles. Like Spotify’s Discover Weekly, but for news.
I believe that the best way to use technology like this, is in amplifying the human abilities, not to replace them. But how we could do that with Timo’s research, I don’t know. Maybe it’s about which articles you did or didn’t read, or more critically. Those are things that you can try. You don’t really want to see the technologies as a solution, but more to create realistic expectations of what does and does not work. The question is: do people want help with that? Does our society benefit from having lots of critical news readers?
Our short-term goals are to make an infrastructure with Reverb Channel to experiment to find the answers to these questions. So that, in the long term, we can attract more and more researchers. Whether those are graduate students, PhD-researchers or people with a specific interest in these topics. In a way, Reverb Channel is like a playground. We have to set up a place in which research can be done. After we have done extensive research, we can look at applications.”