Bidirectional English — Marathi Language Translation Model

9 min readSep 4, 2023

Fine-tuning pretrained mbart50 model to make a en-mr bidirectional translation model using Hugging Face transformers.

mbart | mbart25 | mbart50

BART is a seq2seq architecture model consisting 12 encoder layers , 12 decoder layers and a normalization layer for stabilizing the model.

mbart25 is a pre-trained bart model which is trained on 25 different monolingual language corpora. While training bart model on monolingual text, denoising autoencoding is used, this process involves using masked monolingual data in the target language to create synthetic parallel data, which can then be used for fine-tuning. mbart25 model can be used for different down-stream tasks (one which we are going to deal with is language translation).

There can be 2 types of translation:

Supervised Translation: When bitext(language1 , language 2) are avaliable for both target and source language.
Unsupervised Translation: There can be 2 case for unsupervised translation
1) When no bitext is available of any kind.
2) When bitext is not available for required language pair but available for other language pair.
For unsupervised learning 2 techniques are used Back Translation And Language Transfer is used.
Back Translation: In this technique the target language is first translated to another language then it is translated back to the target language, this helps the model to learn how to translate into target language without any earlier bilingual finetuning for that specific language.
Language Transfer: This allows the model to generalize its translation abilities to a wide range of languages, even if it is not explicitly fine-tuned for that particular language.

mbart50 is an extended version of mbart25, the model was not trained from scratch instead it was pre-trained on mbart25 following the same monolingual technique. This model performs better than the mbart model.

Bilingual Fine-tuning | Multilingual Fine-tuning

In Bilingual fine-tuning pretrained mbart25 is trained on a single pair of language or bitext.

Dataset somewhat looks like: {en-hi} , {mr-en} , {fr-en} etc.

Bilingual fine-tuning gives worse results when compared to multilingual fine-tuning. So instead of training a model from A language to B language the model is trained from N languages to N different languages. This helps to create a single model that can translate to many different languages.

Dataset somewhat looks like: { (en-mr) , (fr-de),…..}

These are 3 versions / types of model that are obtained by multilingual fine-tuning:

one to many
many to one
many to many

One to many model gives worse results when compared to previous models , one of the possible reason to this is that the model has to decode into different languages at same time.

This was a brief introduction to mBART model, intended to familiarize you with its architecture and functioning.

Fine-tuning pretrained mbart50 to make a en-mr bidirectional translation model using 🤗transformers.

HuggingFace:

Hugging Face is a leading AI research organization and technology company that specializes in Natural Language Processing (NLP). They are known for their contributions to open-source NLP libraries and frameworks, including the Transformers library. A detailed guide on Hugging Face transformers can be found here. Their user-friendly tools and APIs enable developers to integrate state-of-the-art NLP capabilities into their applications effortlessly, making complex language tasks more accessible and accelerating innovation in the field of NLP.
Use may check it out here.

Dataset Preparation:

Collected the English — Marathi bitext pairs from different open-source resources combined them into a single dataset dictionary using HuggingFace datasets library.

This unclean dataset can be accessed from here.

Dataset Preprocessing:

Dataset processing is crucial in machine learning and data analysis as it directly impacts the quality and efficiency of models. Properly processed datasets, including cleaning, normalization, feature engineering, and handling missing values, ensure accurate, reliable, and relevant information for model training and evaluation. High-quality datasets reduce noise, prevent biased outcomes, and enable models to capture essential patterns and relationships in the data. Adequate dataset processing enhances the model’s generalization, robustness, and ability to extract meaningful insights, ultimately leading to more successful and impactful AI applications across various domains. Followed the following steps for dataset preparation:

Dropped all the empty and duplicate rows present in the dataset.
Used Facebook’s FastText for language detection and filtered out any other languages present in the dataset.
To address the issue of variable length sequences in the dataset and mitigate noise, a segmentation technique was employed to partition the dataset into four distinct sections based on the lengths of input sentences. This approach allowed for the creation of more homogeneous subsets of data, each containing sequences of similar lengths.By implementing this technical strategy, the dataset’s variability in sequence lengths was effectively reduced, leading to more structured and manageable data subsets for further analysis or modeling.
Subsequently, the sentences within each segmented section were merged together using a specific pattern

Take a look at my GitHub repo for better and detailed understanding.

Dataset can be accessed easily using Hugging Face datasets library with the help of following code:

from datasets import load_dataset
dataset = load_dataset("anujsahani01/processed_en_mr")

Using Hugging Face Data Collator:

Data collators are objects that will form a batch by using a list of dataset elements as input. These elements are of the same type as the elements of train_dataset or test_dataset(eval_dataset).

To be able to build batches, data collators may apply some processing (like padding). Some of them also apply some random data augmentation (like random masking) oin the formed batch. In this model we have used DataCollatorForSeq2Seq as translation , question — answering such task uses sequence to sequence architecture. It can be used in a following manner.

from transformers import DataCollatorForSeq2Seq
data_collator = DataCollatorForSeq2Seq(tokenizer, model=model)

Model: We are going to finetune mbart50 model using my own en-mr dataset using Hugging face.

Text data cannot be directly fed into the model we need a tokenizer to convert the text data into vectors, Hugging Face transformers is a library that contains tokenizers for all the models. Mostly all the tokenizers are available in two forms: a full python implementation and a “Fast” implementation based on the Rust library 🤗 Tokenizers. The facebook’s mbart50 tokenizer can be used with the help of following code.

from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("facebook/mbart-large-50", src_lang="en_XX", tgt_lang="mr_IN", model_max_length = 512)
tokenizer.bos_token = tokenizer.eos_token

# tokenizing the dataset
def tokenize(data):
  data = tokenizer(text = data['marathi'], text_target =  data['english'],  max_length= max_length, padding = 'max_length', truncation=True, return_length = True, return_tensors = 'pt')
  inp_ids = []
  if len(data['input_ids']) != 0:
    for length, ids in zip(data["length"], data["input_ids"]):
      inp_ids.append(ids)
    label = []
    for length, lb in zip(data["length"], data["labels"]):
      label.append(lb)
  return {"input_ids": inp_ids,
          'labels' : label}
tokenized_dataset = dataset.map(
    tokenize,
    batched=True,
    remove_columns =  dataset["train"].column_names,
)

The following parameters were passed to the 🤗 Tokenizer:

text — The sequence or stack of sequences to encode. If the sequences are provided as a list of strings (pre-tokenized), you must set is_split_into_word = True.
text_target (optional) — The sequence or batch of sequences to encode as target texts. If the sequences are provided as a list of strings (pre-tokenized), you must set is_split_into_word = True.
max_length (optional) — Controls the maximum length to use by one of the truncation/padding parameters.
padding (optional ) — ‘max_length’ : Pads to a maximum length specified with the max_length argument, or to the maximum acceptable input length for the model if that argument is not provided.
truncation : Truncate to a maximum length specified with the max_length argument, or to the maximum acceptable input length for the model if this argument is not specified. This truncates token by token and removes a token from the longest sequence in the pair when a sequence pair (or batch of pairs) is provided.
return_length (optional) — Whether or not to return the lengths of the encoded inputs.
return_tensors (optional ) — If set, returns tensors instead of a list of Python integers. This parameter accepts the following values:
‘tf’ : Return TensorFlow objects.
‘pt’ : Return PyTorch objects.
‘np’ : Return Numpy objects.

Loading mbart50 pre-trained model using Hugging Face ‘from_pretrained’ and AutoClass

Hugging Face’s AutoClass is a powerful feature that facilitates seamless and checkpoint-agnostic model loading. It automatically infers and selects the appropriate architecture based on the provided checkpoint, eliminating the need for manual identification and selection. By using the from_pretrained() method, users can quickly access pretrained models for various architectures without having to train models from scratch, saving time and computational resources.

This checkpoint-agnostic capability ensures that code designed to work with one checkpoint will function with another, given they are trained for similar tasks, even if the architectures differ. The from_pretrained() method can deduce the desired architecture from the supplied name or path of the pretrained model, further streamlining the loading process.

For instance, AutoModelForSeq2SeqLM allows users to manually specify the AutoClass for sequence-to-sequence language modeling tasks. This versatility and automation make it easy to retrieve relevant models using pretrained weights, configuration, or vocabulary, providing a more efficient and user-friendly experience for developers working with a wide range of natural language processing tasks. The pre-trained mbart model can be loaded using the following code:

from transformers import AutoModel

from transformers import AutoModelForSeq2SeqLM, AutoConfig
model_checkpoint  = "facebook/mbart-large-50"
config = AutoConfig.from_pretrained(model_checkpoint, vocab_size=len(tokenizer), use_auth_token=True,)
config.vocab_size = tokenizer.vocab_size
model = AutoModelForSeq2SeqLM.from_pretrained(model_checkpoint, config = config, ignore_mismatched_sizes=True).to('cuda')
model.resize_token_embeddings(len(tokenizer))

As of now we are done with Data Preprocessing, Tokenization, Loading pre-trained model the only step we are left with is model training. This can be done easily using Hugging Face Trainer and Training Arguments.

Hugging-Face Trainer and Training Arguments:

Hugging Face Trainer and Training Arguments are powerful tools designed to streamline the training process of Deep learning models. The Trainer is a high-level API that simplifies the training loop, handling the complexities of training, evaluation, and model saving. It allows users to train models with just a few lines of code, abstracting away many implementation details.

Training Arguments is a data class provided by the Transformers library (by Hugging Face), and it serves as a container for all the hyperparameters and configurations needed for training. Users can customize the training process by setting various arguments such as batch size, learning rate, number of epochs, and more. This allows for easy experimentation and fine-tuning of the training process. Together, the Trainer and Training Arguments provide a user-friendly and efficient way to train various natural language processing models, including text classification, question-answering, language generation, and more. It can be used with the following code:

from transformers import Seq2SeqTrainingArguments, Seq2SeqTrainer

args = Seq2SeqTrainingArguments(
    f"./finetuned_en-mr",
    evaluation_strategy= params['evaluation_strategy'],
    save_strategy="epoch",
    learning_rate= params['lr'] ,
    per_device_train_batch_size=params['per_device_train_batch_size'],
    per_device_eval_batch_size= params['per_device_eval_batch_size'],
    warmup_steps = params['warmup_steps'],
    weight_decay= params['weight_decay'],
    save_total_limit= params['save_total_limit'],
    num_train_epochs= params['num_train_epochs'],
    predict_with_generate= params['predict_with_generate'],
    fp16= params['fp16'],
    push_to_hub= params['push_to_hub'],
    max_steps = params['max_steps'],
    remove_unused_columns= params['remove_unused_columns']
)
trainer = Seq2SeqTrainer(
    model.to('cuda'),
    args,
    train_dataset=tokenized_dataset["train"],
    eval_dataset=tokenized_dataset["test"],
    data_collator=data_collator,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics
)
trainer.train()

After training the model for English to Marathi translation, we can further fine-tune it for Marathi to English translation using a technique called bidirectional training. In bidirectional training, the model is modified by making small adjustments to the input and target text during tokenization, effectively interchanging the languages to facilitate translation in both directions.

There are two straightforward methods to achieve this:

The first method involves continuing the training process for Marathi to English translation once the model is initially trained for English to Marathi translation. This way, the model learns to understand the semantics and syntactic structures of both languages, making it bidirectional.
The second method involves loading the pre-trained model into HuggingFace’s spaces, which allows easy versioning and sharing. This method is used when we do not have high GPU ram to further train the model in the same session of google colab. By pushing the model to the Hub, we can conveniently access and reload the fine-tuned model for further training.

trainer.push_to_hub()

The pre-trained model can be fine-tuned again with the modified inputs and target text, ensuring it learns to translate in both directions effectively.

By employing these techniques, a single model can serve as a powerful translation tool for both English to Marathi and Marathi to English, providing bidirectional translation capabilities. This approach optimizes model reusability and reduces the need for separate training pipelines, ultimately enhancing translation performance and efficiency.

Conclusion:

In conclusion, we successfully fine-tuned the powerful mBART50 model for bidirectional English to Marathi and Marathi to English translation using Hugging Face transformers. Leveraging data preprocessing, tokenization, and advanced training techniques, we created a single model capable of translating between these languages in both directions. This bidirectional translation model eliminates the need for separate training, offering enhanced efficiency and reusability. With Hugging Face’s Trainer and Training Arguments, we optimized the training process, simplifying hyperparameter tuning and accelerating model development. This bidirectional translation model has great potential for multilingual applications, facilitating seamless communication and understanding across language boundaries.

GitHub: EngMarBART

Feedback:

If you have any feedback, please reach out to me at: LinkedIn