Building a Next Word Prediction Model in Python

Building a Next Word Prediction Model in Python

Introduction: Next word prediction is a common natural language processing (NLP) task that involves predicting the most likely word to follow a given sequence of words in a sentence or text. In this tutorial, we'll walk through the process of creating a next word prediction model using Python and demonstrate how to train and evaluate the model with code examples.

Dataset Preparation: For this tutorial, we'll use a sample text dataset to train our next word prediction model. You can either use your own text data or download a publicly available dataset. Let's assume we have a dataset named "sample_text.txt" containing a sequence of sentences.

Step 1: Data Preprocessing: The first step is to preprocess the text data by tokenizing the words and creating sequences of input-output pairs for training the model. Here's how you can do it in Python:

pythonCopy code# Load the text data
with open('sample_text.txt', 'r') as file:
    text_data = file.read()

# Tokenize words
words = text_data.split()

# Create input-output sequences
input_sequences = []
output_words = []
sequence_length = 5  # Define the sequence length

for i in range(len(words) - sequence_length):
    input_sequence = words[i:i + sequence_length]
    output_word = words[i + sequence_length]
    input_sequences.append(input_sequence)
    output_words.append(output_word)

Step 2: Feature Engineering: Next, we need to convert the input sequences into numerical vectors using techniques like one-hot encoding or word embeddings. We'll use the Word2Vec model for word embeddings in this tutorial:

pythonCopy codefrom gensim.models import Word2Vec

# Train Word2Vec model
word2vec_model = Word2Vec(sentences=input_sequences, vector_size=100, window=5, min_count=1, sg=1)

# Get the word embeddings
word_embeddings = word2vec_model.wv

Step 3: Model Creation: Now, we'll create a next word prediction model using a simple neural network architecture. We'll use Keras with TensorFlow backend for this purpose:

pythonCopy codefrom tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, LSTM, Dense

# Define the model architecture
model = Sequential([
    Embedding(input_dim=len(word_embeddings.index_to_key), output_dim=100, input_length=sequence_length),
    LSTM(100),
    Dense(len(word_embeddings.index_to_key), activation='softmax')
])

# Compile the model
model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])

Step 4: Model Training: Train the model using the input-output sequences and corresponding word embeddings:

pythonCopy codeimport numpy as np
from tensorflow.keras.utils import to_categorical

# Convert input-output sequences to numerical vectors
X = np.array([word_embeddings[word] for word in input_sequences])
y = to_categorical([word_embeddings.vocab[word].index for word in output_words], num_classes=len(word_embeddings.index_to_key))

# Train the model
model.fit(X, y, epochs=100, batch_size=64)

Step 5: Next Word Prediction: Finally, we can use the trained model to predict the next word given a sequence of words:

pythonCopy code# Function to predict the next word
def predict_next_word(input_sequence):
    input_vector = np.array([word_embeddings[word] for word in input_sequence])
    predicted_index = np.argmax(model.predict(input_vector.reshape(1, -1, 100))[0])
    predicted_word = word_embeddings.index_to_key[predicted_index]
    return predicted_word

# Example usage
input_sequence = ['the', 'quick', 'brown', 'fox']
next_word = predict_next_word(input_sequence)
print("Next Word Prediction:", next_word)

Conclusion: In this tutorial, we've covered the entire process of building a next word prediction model in Python using Word2Vec embeddings and LSTM neural networks. You can further enhance the model by experimenting with different architectures, hyperparameters, and training on larger datasets for improved predictions in real-world applications.