Synthetic Journal Articles Could Drive Real Research Progress

May 7, 2019 Simon Smith

This February, you may have missed a shocking discovery. Dr. Jorge Pérez from the University of La Paz was exploring the Andes Mountains. There, he and his colleagues found a small valley with no humans or animals. Venturing deeper, they came across a species unknown to science: Unicorns. (“Ovid’s Unicorns” to be specific.)

This “news” story might sound ridiculous. But given its origins, it's surprising how ridiculous it isn’t. A machine wrote it after reading millions of web pages and modeling their language. Called GPT-2, it can write tracts of coherent human-like prose on any given topic. It's so good, its creator, OpenAI, won't release it. They fear an explosion of bot-generated fake news indiscernible from real information. Instead, they released a smaller version (kind of lobotomized) that anyone can try.

Many humans have since written much about GPT-2. Some have experimented with the scaled-down version and posted their results. But I haven’t seen anyone discuss the implications for academic publishing. Or scientific research. And these could be as big or bigger than the implications for news.

How a Machine Learns to Write

But first: How does GPT-2 work?

In a nutshell, it writes by probability. From reading millions of documents, it predicts which words likely follow others. Let's say it reads “Jane walked her dog” and “Kate walked her dog,” for example. If you then gave it the prompt “Sally walked,” it would likely write, “her dog.”

That part isn't new. Researchers have long used recurrent neural networks to do this. But their implementations had two key limitations: architecture and data.

First, architecture. Simpler recurrent neural networks don't understand context. If you provide the prompt "Sally walked," they might write, "her dog." But so too if you provided the prompt "The cat named Sally walked." They also take a limited set of characters as input. Your story might start with "Sally walked." You might get "her dog" next. But as the generated text grew, watch out. There would be no consistency from beginning to end. It's like a writer who keeps forgetting what they wrote before, and so goes on constant tangents.

Second, data. Researchers tended to train their language models on small datasets, in part due to cost. Or they would use easy-to-access but homogenous datasets such as Wikipedia. Resulting language models were narrow in knowledge and limited in output.

GPT-2 addresses both of these deficiencies. It uses a new architecture called a transformer to maintain context over long text. And it learned from a diverse set of 8 million web pages, producing one of the largest language models ever. The result: a tool from a simple prompt can write a creative, realistic narrative.

Warning: Fake Alzheimer’s News Ahead

For example, I fed the small, lobotomized model this prompt: “Researchers have discovered a new genetic variant associated with Alzheimer's disease.” Here’s what it wrote (in a few seconds):

Researchers have discovered a new genetic variant associated with Alzheimer's disease.

The findings could affect many other diseases in the family.

Previous research has suggested that an abnormally high risk for Alzheimer's disease is caused by a genetic mutation that appears to be related to a particular gene involved in controlling blood flow, inactivating certain gene molecules that regulate the immune system.

In a research published Tuesday in Nature Medicine, researchers sequenced and studied three genetic variants associated with Alzheimer's, and found that a single mutation is associated with the same gene. And they found that even if the mutation was a genetic mutation, it is not the cause of the disease. In other words, the genetic variance of the Alzheimer's strain did not create a genetic variant in the one who had been exposed to the mutation, researchers found.

"Our findings show how the Alzheimer's disease genetic test could be used to determine whether or not a mutation is responsible for the disease," said lead author Dr. Yili Lu, a neuroscientist at the University of California, San Francisco. She led the study through the lab of Harvard University's director of genetics, Nicholas J. O'Malley, in collaboration with Dr. Nils-Christian Stojberg, an associate professor of genetics and director of the Center for Experimental Biology at Harvard University.

The study was conducted at the American Institute for Human Research and conducted at the University of California, Berkeley, through the National Institutes of Health, the National Institute on Drug Abuse, and Genentech Technologies, which is a joint venture between the US National Institutes of Health's National Biomedical Research Council and Genentech.

The study, which included an analysis of gene activity and gene expression in nearly 1,000 cases, found that the two genes were associated with a different genetic phenotype.

Genomic abnormalities in Alzheimer's disease can occur when a mutation leads to an altered expression of the immune system, resulting in a state known as a genetic syndrome. It is the same disease where many more people develop Alzheimer's disease due to certain genetic defects, such as an imbalance in the gene responsible for the inflammation-prone disorder in Alzheimer's disease. For many people with an increased risk of developing an autosomal dominant mutation.

Previous studies have found that a particular gene involved in regulating blood flow, called ALDH 1A, is present in just a handful of Alzheimer's gene variants that, in combination with the increased risk of brain dysfunction, can cause an increase in blood flow.

Is this perfect? No. But remember: This is the small model. It doesn’t have added training from biomedical documents. Yet it manages to generate a plausible story. It even suggested a candidate gene, ALDH 1A, as involved in Alzheimer's. And studies have indeed linked ALDH to Alzheimer's.

Even Worse than Fake News: Fake Research Papers

This of course raises concerns. Imagine websites filled with fake, bot-generated biomedical news that looks real. It's hard enough to keep up with twists and turns in actual scientific research. (Are eggs, coffee, and fat good for you this week, or bad?)

But the implications for academic publishing go beyond fake news.

Let's start with one major negative implication: fake research papers. It's possible to expand GPT-2's education to biomedicine. You could take the base model, and train it on millions of biomedical papers. GPT-2 could then write on biomedical topics in the style of published research. With enough training data, it's likely only expert peer reviewers could distinguish fakes. (Using generative adversarial networks, you could also generate fake figures for fake papers.)

Why might anyone do this? I can think of a few reasons. One might be to create confusion about a controversial topic. Imagine, for example, a thousand fake papers contradicting climate change findings. Another might be to build a publication history for a real or fictitious person. Think peer review would catch them? Don't count on it. Evidence suggests lax standards at some journals would lead to many published fake papers.

It’s Not All Bad: Meet Your New Writing Assistant

So GPT-2 could clog the world with boatloads of fabricated research. But it's potential isn't all negative.

Think, for example, of summarizing existing research. Since GPT-2 writes based on probabilities, it is capable of summarizing text. So you could train it on text specific to your research, and get a useful summary. You could then insert this summary into a paper. Researchers have begun to use AI in this way already. Recently, for example, they used AI to boil down 1,086 papers into a 250-page book.

Even better, you could call up specific summarization in context. Use different prompts, and you would get different outputs. Want to write about the history of your subject? Use a prompt such as, "The history of this field begins..." Want to write about new findings? Use a prompt such as, "New findings in this field include..."  You will get useful outputs for each that you can then incorporate into a paper.

Then there are more speculative possibilities. Might GPT-2 generate novel hypotheses, for example? In the output above, it links ALDH to Alzheimer's. This is because within the language model, there's a certain probability they're related. If you further trained GPT-2 on biomedical text, it would learn new associations. Some of these associations might be novel. And you might be able to probe them with specific prompts. For example, "Neuroinflammation is an important contributor to Alzheimer's disease pathogenesis, as underscored by..."

Given the growing sophistication of language models, I'm sure there are many more uses. If you think of any, let me know in the comments. If not, I can always prompt GPT-2 to do so. "A great use of GPT-2 for academic publishing is..."

About the Author

Simon Smith

Chief Growth Officer of BenchSci // Simon Smith is Chief Growth Officer of BenchSci, which helps biomedical researchers run more successful experiments by using AI to select antibodies. On the BenchSci blog, he writes more about AI in drug discovery.

More Content by Simon Smith
Previous Article
5 Tips for Communicating Research to Policy Makers
5 Tips for Communicating Research to Policy Makers

Turning your research into tangible policy change.

Next Article
Engineers Making an Impact
Engineers Making an Impact

5 ways engineers are changing the world