An Architecture for Data-to-Text Systems

An Architecture for Data-to-Text Systems

I was very happy to be given a Test-of-Time award at INLG for my 2007 paper An Architecture for Data-to-Text Systems. Test-of-Time awards are given to old papers (in this case, published at least 10 years ago in INLG or ENLG) which have had a major impact and continue to be cited in 2022. This is the first Test of Time award from INLG, so I guess the reviewers must have considered this paper to be top of the list!

I must say its been a good year for awards. My former student Meg Mitchell won an ACL Test of Time award for work she did on image captioning while she was a PhD student at Aberdeen, my current PhD student Francesco Moramarco won an award at NAACL 2022 for Best paper on human-centred NLP (special theme), and now I’ve won the INLG test of time award. Not bad!

Anyways, I thought I’d say a bit about the paper and also the context (which a few people have asked me about)

Aberdeen in 2007 was a really exciting place to do NLG research. Our NLG group was led by Chris Mellish, and included myself, Kees van Deemter, and Yaji Sripada (faculty); Graeme Ritchie, Albert Gatt, and Francois Portet (research fellows); and several PhD students including Saad Mahamood, Nava Tintarev, and Ross Turner. I think it was the biggest NLG group in the world at the time.

I myself was a Reader (similar to Associate Professor in USA) and focusing on data-to-text, ie building NLG systems which summarised, explained, and otherwise communicated complex numeric and symbolic data sets. We had just started working on the Babytalk project, whose goal was to generate summaries of clinical data from babies in neonatal intensive care, for doctors, nurses, and parents. It was the most ambitious data-to-text project attempted to date, which involved a research team from many backgrounds (data analysis, knowledge-based reasoning, NLG, medical informatics, psychology, neonatal care), and I wanted to come up with an architecture which integrated the many types of reasoning and knowledge needed for complex data-to-text processing. The architecture also needed to cover both data-analytics and NLP and be consistent with previous work in this area.

In the paper, I essentially proposed that data-to-text systems be treated as a pipeline of four components, each of which did a different type of processing. Identifying the types of reasoning needed was as important as constructing a pipeline. Anyways, the stages were

The above architecture separates different types of reasoning into different modules, which makes it easier for multidiscipinary teams to collaborate. For example, data scientists can work on signal analysis, subject matter experts can work on data interpretation, and computational linguists can work on microplanning and realisation.

In 2007, we expected machine learning to be used in signal analysis but not elsewhere. In 2022, machine learning could in principle be used in all of the above stages. However, I would strongly recommend that something like the above architecture be used; I believe that a pipeline of focused modules will do a better job than an “end-to-end” approach, especially for more complex data-to-text applications.

If you want to learn more, I suggest you read the paper, people have told me that it is it is relatively accessible and easy to read, even for non-specialists.

This paper currently (at the time I’m writing this blog) has around 300 citations on Google Scholar, which I suspect makes it one of the most cited ENLG/INLG papers. I like to think that most people working in (complex) data-to-text cite this paper; even if they dont use the pipeline architecture, their systems still needs to perform the above types of reasoning in some fashion.

The Babytalk project also achieved a lot of recognition (with the main publications being in journals, not conferences), and many researchers and indeed developers started using the simplenlg package (especially after we had improved it and released version 4).

I should say that in general I am proudest of my journal papers, not my conference papers. But among my conference papers, An Architecture for Data-to-Text Systems is certainly one of my favourites!

E Reiter (2007). An Architecture for Data-to-Text Systems. Proceedings of ENLG-2007, pages 97-104. URL:

Images Powered by Shutterstock