Scientists have developed an algorithm which they claim can predict the commercial success of a book with 84 per cent accuracy.
Using the technique ‘statistical stylometry’, which appears more suited to maths nerds than literary ones, scientists sought to determine what connection, if any, there was between writing style and successful literature based on a range of factors.
Looking at novels from several different genres the computer scientists from Stony Brook University in New York found that there were distinct linguistic patters shared among successful literature from the same genre, making it possible to build a model with 'surprisingly high accuracy'.
Less successful books appear to rely on descriptive action and emotion verbs such as ‘wanted’, ‘took’, ‘promised’, ‘cried’, and ‘cheered’ while more successful books used verbs that describe thought processes: ‘recognised’, ‘remembered’, and simple verbs that served the process of quotes, such as ‘said’ and ‘say’.
More successful books used discourse connectives ‘and’, ‘which’, that’ and ‘as’ more frequently, while less successful books used words that many would associate with clichés such as ‘breathless’, ‘beach’ and ‘perfectly’.
Success was determined by using the download count from the Project Gutenberg archive, which gives users access to 40,000 classic copyright-free novels. They included books across several categories including adventure, classic literature, poetry and science fiction.
The first sentence from 800 separate novels was used to complete the study. These sentences were then put up against authorship attribution, genre detection, gender identification and native language detection. For a small number of novels awards were also considered – such as Pulitzer and Nobel Prizes. Amazon sales were also used to help determine success.
The researchers made sure that no single author had more than two books in the data-set in order to eliminate bias from a few successful authors, instead ensuring they examined overall linguistic patterns.
Some books not included in Project Gutenberg were deliberately added to the study such as The Old Man and the Sea by Earnest Hemingway, which was chosen for Hemingway’s typically minimalist style, and The Lost Symbol by Dan Brown, which was selected because it was a commercially successful book that was definitely not a hit with critics.
Co-author of the study, Stony Brook Department of Computer Science Assistant Professor Yejin Choi, told the Stony Brook Newsroom that to the best of their knowledge this was the first quantitative research that provides insights into the connection between writing style and success.
‘Previous work has attempted to gain insights into the "secret recipe" of successful books. But most of these studies were qualitative, based on a dozen books, and focused primarily on high-level content—the personalities of protagonists and antagonists and the plots. Our work examines a considerably larger collection—800 books—over multiple genres, providing insights into lexical, syntactic, and discourse patterns that characterize the writing styles commonly shared among the successful literature,' Choi said.
‘It sets forth an understanding of the connection between successful writing style and readability. We also shed light on the connection between sentiment/connotation and literary success, and put forward comparative insights between successful writing styles of fiction and nonfiction.’
Those who have found themselves struggling to get through critically acclaimed classic literature are also not alone, with the study finding that the complexity of a highly successful literary work may require ‘syntactic complexity’ that goes against readability.