Syntactic parsing

lambeq’s string diagrams are based on a pregroup grammar to keep track of the types and the interactions between the words in a sentence. When a detailed syntactic derivation is required (as in the case of DisCoCat), a syntax tree needs to be provided by a statistical parser. However, since the pregroup grammar formalism is not particularly well-known in the NLP community, there is currently no wide-coverage pregroup parser that can automatically provide the syntactic derivations. To address this problem, lambeq provides a passage from a derivation in the closest alternative grammar formalism, namely Combinatory Categorial Grammar (CCG), to a string diagram which faithfully encodes the syntactic structure of the sentence in a pregroup-like form [YK2021]. Due to the availability of many robust CCG parsing tools, this allows the conversion of large corpora with sentences of arbitrary length and syntactic structure into pregroup and DisCoCat form.

Since Release 0.2.0, the standard lambeq installation includes a state-of-the-art CCG parser based on [SC2021], fully integrated into the toolkit. This parser is provided under the name Bobcat. Additionally, lambeq implements a detailed interface in the text2diagram package that allows connection to one of the many external CCG parsing tools that are currently available. For example, lambeq is also shipped with support for depccg [1] [YNM2017], a fast parser that comes with a convenient Python interface.

Additional external parsers can be made available to lambeq by extending the CCGParser class in order to create a wrapper subclass that encapsulates the necessary calls and translates the respective parser’s output into CCGTree format.

Finally, for users who prefer to keep the installation of the toolkit light, lambeq also includes a web-based parser class that sends parsing queries to an online API, so that local installation of a full CCG parser is not strictly necessary anymore – although strongly recommended for most practical uses of the toolkit.

Reading CCGBank

The CCG compatibility makes immediately available to lambeq a wide range of language-related resources. For example, lambeq features a CCGBankParser class, which allows conversion of the entire CCGBank corpus [2] [HS2007] into string diagrams. CCGBank consists of 49,000 human-annotated CCG syntax trees, converted from the original Penn Treebank into CCG form. Having a gold standard corpus of string diagrams allows various supervised learning scenarios involving automatic diagram generation. Fig. 2 below shows the first tree of CCGBank‘s Section 00 converted into a string diagram.

_images/ccgbank.png — Fig. 2 The first derivation of CCGBank as a string diagram.

Footnotes