Dr. House ChatBot Design

This project is to hold a “chat” with a Dr. House language model.

See current demo here.

Status: DRAFT This project recreates a prior version of this project, which used the older, pre-Keras TF 1.0 seq2seq package from 2016. There have been significant LM improvements since then (namely, Transformers).

Objective

Language modelling is quite general, and the medium-term idea is to perform protein language modelling in some form. The main goal here is to re-familiarize with the latest language modelling toolkits by building an LM-powered web-app.

There are several reasons why Dr. House chatbot makes a particularly good language modeling project:

Data

The simplest format would be a single input, followed by a single response:

{
    "input": "I'm playing racquetball tomorrow night, with Taub.",
    "response": "Why would you hide that?"
}

However, this option negates some important aspects of the back-and-forth nature of the dialogue. It could make more sense to model the data as a longer-form conversation, with a “target speaker” that is modelled. Conversational text such as movie scripts tend to have more turn-taking and exchange than NMT or other “typical” seq2seq tasks. Transformer architectures in particular can query for specific pieces of information in the larger context to generate the next word, likely leading to dramatic quality improvements over RNNs. To achieve these benefits, providing more context to the above conversation might provide data in a format like the following:

{
    "target_speaks_at_turns": [0, 2, 4, 6, 8, ...],
    "conversation": [
        "Anyone sitting here?",
        "Just my persona.",
        "You know, it's amazing the way people cling on to insults. Or what they think are insults. (He takes a sandwich and fries off of Wilson’s plate and puts them on his own)",
        "So that wasn't an insult?",
        "I'm not suggesting that, like our patient, you're hiding a dark, sarcastic core beneath a candy shell of compulsive niceness. (House has pulled a fork out of his breast pocket)",
        "I'm not always nice. I'm not nice to you.",
        "Because you know nice bores me. Hence, still nice. No, I'm suggesting that you have no core. You're what whoever you're with needs you to be. Okay, I guess that could be insulting. The interesting question is why. Why do you think the world will end in chaos and destruction if you're not there to save it? (He starts eating Wilson’s lunch)",
        "Because when my parents put me in the rocket and sent me here, they said, 'James, you will grow to manhood under a yellow sun.'",
        "And why'd you lie about monster trucks?",
        "I didn’t.",
        "I checked your appointment book. You got tomorrow night marked off, but you didn't put down what you were doing. So you thought someone might look at the book —",
        "I'm playing racquetball tomorrow night, with Taub.",
        "Why would you hide that?",
        "Because the world revolves around you. I devote time to anyone else, you'd end up stalking me and harassing them.",
        "You say that as though it wouldn't be fun.",
        "And maybe I didn't want to rub your nose in the fact that we'd be doing something you can no longer do. Because I'm nice.",
    ]
}

BeautifulSoup and Requests can be used to download/parse the conversations (e.g. this tutorial).

Modelling

Model types: Two powerhouse architectures of language modeling are the LSTM and the Transformer. Since the LSTM provides streaming capabilities, it can be more efficient in inference; however, this efficiency comes at the cost of accuracy. Transformers use self-attention to query for long-range dependencies in the input explicitly, and can dramatically improve the modelling quality. An encoder-decoder architecture in which the decoder generates a full response from a frozen copy of the input context can further improve performance.

The loss function can be cross entropy over the “next” token: $P(y_t | y_{1 \dots t - 1})$. At decoding, either sampling or $\argmax$ over this distribution can be used.

The loss will be masked to apply only to tokens where the target speaker (Dr. House) talks. A special <EOS> token will demarcate speaker changes. At inference time, the model will receive the prior conversation as input, decode until an <EOS> token is emitted, and then pass the turn back to the user. Modelling a stream of text with a turn-taking <EOS> token also has some small benefits:

Experimental notes follow: The downside to masking was implementing a tf.function in a custom training loop, rather than using Keras built-in training loops (compile and then fit). The custom training loop provides flexibility at the cost of verbosity. The keras class_weights feature could provide some improvements here.

The ml-collections library is helpful for creating training configurations.

Training and testing configurations were split on a per-episode level, rather than a per-example level. This split ensures that the test set remains relatively novel.

A reference for different gradient descent optimizers is here.

Inference

The model should run on the web, to serve it for this website. While it is more difficult to configure, it will be more sustainable in the long-term to remove any dependency on a separate web server. Fewer moving parts will enable faster iteration.

Another alternative would be AWS Lambda, which removes the overhead of a separate service. It is likely that this approach would be viable. If the local load time is too high, then the model can be moved to a lambda function relatively easily to provide a good user experience for low-latency loading. If the local load time is not too high, then the simplicity of on-device inference is preferred.

The user will need to load a vocabulary file, and the client will have to do some basic tokenization, to pre-process the input text. For this reason, it can be beneficial to use simple tokenization (either whole-word or char-level), rather than relying on a large WordPiece model to tokenize the text.

In terms of JavaScript neural net runtimes, I see two primary options (one for TensorFlow, and one for PyTorch).

(1) TensorFlow.JS has mature documentation:

(2) ONNX documentation is also pretty good:

Conversion

The architecture of a language model has a distinction between training and inference.

The LSTM char-seq2seq Keras tutorial hints at this issue in the inference section: the code creates a separate copy of the model for inference (via copy-paste-modify).

To reuse the same model in either mode, the keyword-spotting streaming library (github) provides some inspiration. In particular, the stateful model $LSTM(x_t)$ is technically a stateless function of its state $LSTM(x_t, S_{t-1})$. The inference loop can track the external state, in a loop resembling something like the following code snippet:

def Decode(lstm: keras.Model, input_token_ids: List[int],
           eos_token_id: int=EOS_TOKEN_ID,
           max_decoder_len=20) -> List[int]:
    # Initialize the model state from the inputs.
    state = lstm.zero_states()
    softmax, *state = lstm([token_ids, state])

    # The decode function is a greedy argmax
    # (could also implement beam search).
    decode = lambda softmax: np.argmax(softmax)

    # Decode the rest of the sequence.
    # Keep predicting the next token,
    # until the model emits end-of-sentence.
    # The max decoder length ensures termination,
    # even in the case of model error.
    next_token = decode(softmax)
    decoder_outputs = [next_token]
    while (next_token != eos_token_id or
           len(decoder_outputs) >= max_decoder_len):
        # Run the model for another timestep,
        # with the prior output and prior state as input.
        softmax, *state = lstm([[next_token_id], state])

        # Decode the next token, and store it.
        next_token = decode(softmax)
        decoder_output.append(next_token_id)

    # Finished.
    return decoder_output

To achieve the desired inference behavior, try:

training_model = LoadModel(training_dir)
inference_model = CreateModel(GetConfig(training_dir), mode='inference')
for training_layer, inference_layer in zip(
    training_model.layers, inference_model.layers):
    inference_layer.set_weights(
        training_layer.get_weights())

Weight Tying

Weight-Tying is helpful for small LMs, because if the vocabulary is large (e.g. 10k) and in a small LSTM LM, most of the parameters could be used in the embedding layer.

A note on SavedModel format and Weight Tying: the SavedModel format traces the Keras model, and in calling each layer, records the operations. This approach means that a simple operationa like WeightTying does not copy any variables it uses twice, provided you do not save them as local variables. The below is a simple implementation of a custom Keras Layer to emit logits (the Activation layer can add the softmax on top).

class WeightTiedLayer(keras.layers.Layer):
    '''
    Implements Weight-Tying to the embedding layer.
    
    Usage:
    logits = WeightTiedLayer(embedding)(outputs)
    '''

    def __init__(self, embedding_layer=None, **kwargs):
        super(WeightTiedLayer, self).__init__(**kwargs)
        self.embedding_layer = embedding_layer

    def build(self, input_shape):
        self.bias = self.add_weight(
            shape=(self.embedding_layer.input_dim,),
            name='bias',
            initializer='random_normal',
            trainable=True)
        
    def call(self, inputs):
        # Shapes:
        # - inputs: (batch, seq, N)
        # - embedding_weights: (vocab_size, N)
        # - embedding_weights.T: (N, vocab_size)
        embedding_weights = self.embedding_layer.weights[0]
        output = inputs @ tf.transpose(embedding_weights)
        return output + self.bias

UI

The UI can be quite simple; the main challenge will be to design the integration.