{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Step 1. Sentence input" ] }, { "cell_type": "raw", "metadata": { "raw_mimetype": "text/restructuredtext" }, "source": [ "The first part of the process in ``lambeq`` given a sentence, is to convert it into a :term:`string diagram`, according to a given :term:`compositional scheme `. ``lambeq`` can accommodate any :term:`compositional model` that can encode sentences as :term:`string diagrams `, its native data structure. The toolkit currently includes a number of :term:`compositional models `, using various degrees of syntactic information: :term:`bag-of-words` models do not use any syntactic information, :term:`word-sequence models ` respect the order of words, while fully syntax-based models are based on grammatical derivations provided by a parser.\n", "\n", ":download:`Download code <../_code/sentence-input.ipynb>`" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Pre-processing and tokenisation" ] }, { "cell_type": "raw", "metadata": { "raw_mimetype": "text/restructuredtext" }, "source": [ "Depending on the form of your data, some preprocessing steps may be required to make it appropriate for ``lambeq`` use. Section :ref:`sec-preprocessing` in the :ref:`NLP-101 tutorial ` provides more information about this. Here we will mainly talk about :ref:`tokenisation `, which is crucial in getting correct derivations from the :term:`Bobcat` parser. \n", "\n", "The term `tokenisation` refers to the process of breaking down a text or sentence into smaller units called `tokens`. In ``lambeq`` these tokens correspond to words, since the parser needs to know exactly what kind of words or symbols and punctuation marks are included in the sentence in order to provide an accurate grammatical analysis.\n", "\n", "By default, Bobcat parser assumes that every sentence is delimited by a whitespace, as below:\n", "\n", ".. code-block:: console\n", "\n", " \"John gave Mary a flower\"\n", " \n", "Note however that when working with raw text, this is rarely the case. Consider for example the sentence:\n", "\n", ".. code-block:: console\n", "\n", " \"This sentence isn't worth £100 (or is it?).\"\n", " \n", "A naïve tokenisation based on white spaces would result in the following list of tokens:\n", "\n", ".. code-block:: console\n", "\n", " [\"This\", \"sentence\", \"isn't\", \"worth\", \"£100\", \"(or\", \"is\", \"it?).\"]\n", " \n", "missing, for example, that \"isn't\" represents actually two words and \"(or\" is not a proper word. \n", "\n", "In ``lambeq``, tokenisation is provided through the :py:class:`~.Tokeniser` class hierarcy, and specifically by using the :py:class:`~.SpacyTokeniser` class, based on the popular NLP package `SpaCy `_. Here is an example:" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['This',\n", " 'sentence',\n", " 'is',\n", " \"n't\",\n", " 'worth',\n", " '£',\n", " '100',\n", " '(',\n", " 'or',\n", " 'is',\n", " 'it',\n", " '?',\n", " ')',\n", " '.']" ] }, "execution_count": 1, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from lambeq import SpacyTokeniser\n", "\n", "tokeniser = SpacyTokeniser()\n", "sentence = \"This sentence isn't worth £100 (or is it?).\"\n", "tokens = tokeniser.tokenise_sentence(sentence)\n", "tokens" ] }, { "cell_type": "raw", "metadata": { "raw_mimetype": "text/restructuredtext" }, "source": [ "We can then pass the list of the tokens to the parser, setting the ``tokenised`` argument of the :py:meth:`~.BobcatParser.sentence2diagram` method to True." ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "data": { "image/png": "", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "from lambeq import BobcatParser\n", "\n", "parser = BobcatParser(verbose='suppress')\n", "diagram = parser.sentence2diagram(tokens, tokenised=True)\n", "\n", "diagram.draw(figsize=(23,4), fontsize=12)" ] }, { "cell_type": "raw", "metadata": { "raw_mimetype": "text/restructuredtext" }, "source": [ ".. note::\n", "\n", " More details about :term:`DisCoCat` and syntax-based models will follow below.\n", " \n", "To tokenise many sentences at once, use the :py:meth:`~.SpacyTokeniser.tokenise_sentences` method:" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[['This', 'is', 'a', 'sentence', '.'],\n", " ['This', 'is', '(', 'another', ')', 'sentence', '!']]" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "sentences = [\"This is a sentence.\", \"This is (another) sentence!\"]\n", "\n", "tok_sentences = tokeniser.tokenise_sentences(sentences)\n", "tok_sentences" ] }, { "cell_type": "raw", "metadata": { "raw_mimetype": "text/restructuredtext" }, "source": [ "Finally, ``lambeq`` provides tokenisation at the sentence-level:" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['I love pizza.', 'It is my favorite food.', 'I could eat it every day!']" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "text = \"I love pizza. It is my favorite food. I could eat it every day!\"\n", "sentences = tokeniser.split_sentences(text)\n", "sentences" ] }, { "cell_type": "raw", "metadata": { "raw_mimetype": "text/restructuredtext" }, "source": [ ".. note::\n", "\n", " To simplify the rest of this tutorial, all sentences in the following sections will be delimited by white spaces, so that the parser can tokenise them properly without extra handling." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Syntax-based model: DisCoCat" ] }, { "cell_type": "raw", "metadata": { "raw_mimetype": "text/restructuredtext" }, "source": [ "In order to obtain a :term:`DisCoCat`\\ -like output, we first use the :py:class:`.BobcatParser` class from :py:mod:`~lambeq.text2diagram` package, which, in turn, calls the :term:`parser`, obtains a :term:`CCG ` derivation for the sentence, and converts it into a :term:`string diagram`. The code below uses the default :term:`Bobcat` parser in order to produce a :term:`string diagram` for the sentence \"John walks in the park\".\n", "\n", ".. note::\n", " \n", " ``lambeq``'s string diagrams are objects of the class :py:class:`lambeq.backend.grammar.Diagram`." ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "image/png": "", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "from lambeq import BobcatParser\n", "\n", "sentence = 'John walks in the park'\n", "\n", "# Parse the sentence and convert it into a string diagram\n", "parser = BobcatParser(verbose='suppress')\n", "diagram = parser.sentence2diagram(sentence)\n", "\n", "diagram.draw(figsize=(14,3), fontsize=12)" ] }, { "cell_type": "raw", "metadata": { "raw_mimetype": "text/restructuredtext" }, "source": [ ".. note::\n", "\n", " Recall from previous section that when the input to :py:meth:`~.sentence2diagram` method is a list of tokens, you should also set ``tokenised`` argument to True (by default is set to False).\n", "\n", "Another case of syntax-based models in ``lambeq`` is :ref:`tree readers `, which will be presented later in this tutorial." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Bag-of-words: Spiders reader" ] }, { "cell_type": "raw", "metadata": { "raw_mimetype": "text/restructuredtext" }, "source": [ ":term:`DisCoCat` is not the only :term:`compositional model` that ``lambeq`` supports. In fact, any compositional scheme that manifests sentences as :term:`string diagrams `\\ /:term:`tensor networks ` can be added to the toolkit via the readers of the :py:mod:`.text2diagram` package. For example, the :py:obj:`~lambeq.text2diagram.spiders_reader` object of the :py:class:`.LinearReader` class represents a sentence as a \":term:`bag-of-words`\", composing the words using a :term:`spider` (a commutative operation)." ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "image/png": "", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "from lambeq import spiders_reader\n", "\n", "# Create string diagrams based on spiders reader\n", "spiders_diagram = spiders_reader.sentence2diagram(sentence)\n", "\n", "# Not a pregroup diagram, we can't use grammar.draw()\n", "spiders_diagram.draw(figsize=(13,6), fontsize=12)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Word-sequence models: Cups and stairs readers" ] }, { "cell_type": "raw", "metadata": { "raw_mimetype": "text/restructuredtext" }, "source": [ "The :py:class:`.LinearReader` class can be used to create any kind of model where words are composed in sequence, from left to right. For example, the :py:obj:`~lambeq.text2diagram.cups_reader` instance of this class generates a \":term:`tensor train`\"." ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "image/png": "", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "from lambeq import cups_reader\n", "\n", "# Create string diagrams based on cups reader\n", "cups_diagram = cups_reader.sentence2diagram(sentence)\n", "\n", "cups_diagram.draw(figsize=(12,2), fontsize=12)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Note the use of a `START` symbol in the beginning of the sentence, represented as an order-1 tensor (a vector). This ensures that the final result of the computation (that is, the representation of the sentence) will be again a tensor of order 1." ] }, { "cell_type": "raw", "metadata": { "raw_mimetype": "text/restructuredtext" }, "source": [ "Another pre-made word-sequence model is provided by the :py:obj:`~lambeq.text2diagram.stairs_reader` instance. This model combines consecutive words using a box (\"cell\") in a recurrent fashion, similarly to a recurrent neural network. " ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "image/png": "", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "from lambeq import stairs_reader\n", "\n", "stairs_diagram = stairs_reader.sentence2diagram(sentence)\n", "stairs_diagram.draw(figsize=(12,5), fontsize=12)" ] }, { "cell_type": "raw", "metadata": { "raw_mimetype": "text/restructuredtext" }, "source": [ ".. _sec-tree-readers:" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Tree readers" ] }, { "cell_type": "raw", "metadata": { "raw_mimetype": "text/restructuredtext" }, "source": [ "A :term:`CCG ` derivation follows a biclosed form [YK2021]_ , which can be directly interpreted as a series of compositions without any explicit conversion into a :term:`pregroup ` form. Class :py:class:`.TreeReader` implements a number of compositional models by taking advantage of this fact. In order to demonstrate the way they work, it would be useful to first examine how a CCG diagram looks like:" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "\n", "
" ] }, { "cell_type": "raw", "metadata": { "raw_mimetype": "text/restructuredtext" }, "source": [ "Even without knowing the specifics of CCG syntax, it is not difficult to see that the verb \"gave\" is first composed with the indirect object \"Mary\", then the result is composed with the noun phrase \"a flower\" which correspond to the direct object, and finally the entire verb phrase \"gave Mary a flower\" is further composed with the subject \"John\" to return a sentence. A :py:class:`.TreeReader` follows this order of composition, as demonstrated below." ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "data": { "image/png": "", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "from lambeq import TreeReader\n", "\n", "reader = TreeReader()\n", "sentence = \"John gave Mary a flower\"\n", "\n", "tree_diagram = reader.sentence2diagram(sentence)\n", "tree_diagram.draw(figsize=(12,5), fontsize=12)" ] }, { "cell_type": "raw", "metadata": { "raw_mimetype": "text/restructuredtext" }, "source": [ "Note that in this default call, composition is handled by a single \"cell\" named ``UNIBOX``. This can be changed by passing an explicit argument of type :py:class:`.TreeReaderMode` to the reader's constructor. There are three possible choices:\n", "\n", "- :py:obj:`NO_TYPE` is the default, where all compositions are handled by the same ``UNIBOX`` cell (above diagram).\n", "- :py:obj:`RULE_ONLY` creates a different cell for each CCG rule.\n", "- :py:obj:`RULE_TYPE` creates a different cell for each (rule, type) pair.\n", "\n", "For example:" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "data": { "image/png": "", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "from lambeq import TreeReader, TreeReaderMode\n", "\n", "reader = TreeReader(mode=TreeReaderMode.RULE_ONLY)\n", "sentence = \"John gave Mary a flower\"\n", "\n", "tree_diagram = reader.sentence2diagram(sentence)\n", "tree_diagram.draw(figsize=(12,5), fontsize=12)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In the above, each unique CCG rule gets its own box: FA boxes correspond to forward application, and BA boxes to backward application. For certain tasks, making the composition box rule-specific might lead to better generalisation and overall performance." ] }, { "cell_type": "raw", "metadata": { "raw_mimetype": "text/restructuredtext" }, "source": [ ".. rubric:: See also:\n", "\n", "- :ref:`sec-preprocessing`\n", "- :ref:`lambeq.text2diagram package `\n", "- `Example notebook parser.ipynb <../examples/parser.ipynb>`_\n", "- `Example notebook reader.ipynb <../examples/reader.ipynb>`_\n", "- `Example notebook tree-reader.ipynb <../examples/tree-reader.ipynb>`_\n", "- `DisCoCat in lambeq <./discocat.ipynb>`_\n", "- `Extending lambeq <./extend-lambeq.ipynb>`_" ] } ], "metadata": { "language_info": { "name": "python" } }, "nbformat": 4, "nbformat_minor": 4 }