ArabicNLP 2026 Shared Task

AlexandriaX‑2026

Context-Aware Dialectal Arabic MT and MT Evaluation

A shared task for building and diagnosing dialectal Arabic MT systems. Participants can build systems that use rich dialogue context, speaker metadata, persona attributes, and multiple dialectal varieties. They can also build systems that diagnose dialectal Arabic translations by identifying error spans and assigning an error type to each span.

Registration

Registration is open for AlexandriaX-2026.

Register by July 20, 2026, and join the Google Group for task announcements, data releases, and evaluation updates.

Motivation

Dialectal Arabic MT needs context, diagnosis, and broader coverage.

Arabic dialectal machine translation remains difficult because of regional linguistic variation, limited high-quality parallel data, non-standardized orthography, and translation choices that depend on social and conversational context.

AlexandriaX-2026 Shared Task evaluates systems that preserve meaning while adapting lexical, morphological, pragmatic, and sociolinguistic choices to the requested Arabic dialect. It also covers MT evaluation with interpretable error detection so teams can see not only which systems score well, but where and why they fail.

Dialogue-aware translation

Source turns are provided with previous context, domain information, and speaker-addressee gender metadata.

Broad dialectal coverage

The MT subtasks span broad Arabic varieties instead of treating dialectal Arabic as one label.

Cross-dialect MT

Short financial-domain queries are translated from one Arabic variety into another.

Diagnostic MT evaluation

Span-level LQM-inspired annotations support error detection and classification.

Shared Task Subtasks

Three complementary routes into dialectal Arabic MT.

Participants can build context-aware translation systems, cross-dialect Arabic MT systems, diagnostic error-analysis systems, or any combination of the three.

1

Context-Aware English-to-Dialectal Arabic Dialogue Translation

Constrained: provided data only, <=5B parameters Unconstrained: external resources allowed, no restrictions on model usage

Objective

Participants need to build a machine translation system that translates selected English dialogue turns into a specified dialectal Arabic variety. The system must use the provided context, such as previous dialogue turns, target country/dialect label, domain, speaker/addressee gender, and persona information when available. The output should preserve the original meaning while sounding natural and appropriate in the target dialect, including correct lexical, morphological, pragmatic, and sociolinguistic choices.

Participants may enter either or both tracks:

  • Constrained track: use only the provided Alexandria dataset, with models up to 5B parameters.
  • Unconstrained track: allowed to use external data as well, pretrained models, and larger models without a parameter limit.

Data

  • Training and development data for model building.
  • Private test set with hidden gold translations.

Evaluation

  • spBLEU is the primary metric; chrF++ is also reported.
  • Overall average across all countries.
  • Constrained and unconstrained submissions will be separated later after participants submit their system descriptions. For now, both tracks will submit to the same CodaBench competition.

Data Split Sizes (in turns)

Split EG JO LB LY MA MR OM PS SA SD SY TN YE
train 3108 5501 8906 0 2573 5515 6280 14933 8470 0 6071 2034 3089
dev 1113 1113 1118 0 1110 1114 1109 1110 1110 0 1119 1116 1118
Public Test 1118 1107 1106 1109 1115 1112 1118 1109 1113 1106 1114 1109 1106
Private Test (Test Phase) 1113 1109 1110 1309 1111 1119 1107 1111 1114 915 1114 1114 1113

Libya and Sudan do not include train or development sets, but they are still included in the test phase to evaluate how well participant systems generalize to these dialects.

Subtask 1 currently has two test splits: the public test set and the private test set. The public test set is already available on Hugging Face and can be used by participants to evaluate current systems. The private test set is reserved for the test phase of the subtask and will be used to evaluate final participant systems; evaluation on this split will be available only through CodaBench.

2

Cross-Dialect Arabic Machine Translation

Arabic-to-Arabic MT Financial-domain queries

Objective

Participants need to build a system that translates text from one Arabic variety into another Arabic variety. The input is a short financial-domain query written in a source Arabic variety, plus the required target dialect or variety. The system must generate a translation that keeps the same meaning, uses appropriate financial terminology, and sounds fluent and natural in the target dialect.

The task will test both familiar dialect pairs and unseen dialect-pair settings to measure robustness across Arabic varieties.

Data

  • Short financial-domain queries in source Arabic varieties.
  • Target dialect or variety labels for each input.

Evaluation

  • Target-side translation quality and dialectal naturalness.
  • Seen and unseen dialect-pair settings for robustness analysis.
  • Official metrics and ranking details will be released with the evaluation scripts.
3

Dialectal Arabic MT Error Detection and Classification

Span prediction Error classification LQM-inspired typology

Objective

Participants need to build an evaluation system that analyzes machine translation outputs and identifies translation errors. The system must do two things: first, predict the exact word-level span in the translated text where an error occurs; second, assign an error category to each span using the provided LQM-inspired typology. Error classification categories are graphetics, morphosyntax, orthography_writing_conventions, pragmatics, semantics, and sociolinguistics.

Data

  • Training, development, and test data for MT error detection and classification.
  • Gold span and error-class labels are withheld for official evaluation.

Evaluation

  • Exact Match F1.
  • Overlap F1.
  • Error Class Macro-F1.
  • Overall Score: average of Exact Match F1 and Error Class Macro-F1.

Data Distribution

Split Overall ENG_EGY ENG_MAU ENG_MOR ENG_PAL ENG_UAE
Train 1125 263 244 169 218 231
Dev 138 32 30 21 27 28
Test 145 34 31 22 28 30

Data Examples

Data Examples from AlexandriaX-2026 subtasks.

Subtask 1

Context-Aware Dialogue Translation

Domain: Agriculture and farming Dialect: Moroccan Standard Darija Dialect

English Conversation

Female -> Male · Wholesale Buyer · Turn 1

For me, Hajj, this is more than just paper. I want to build a long-term partnership with you, one that is based on trust.

Male -> Female · Farmer · Turn 2

You speak the truth, my daughter. A contract is just ink, but trust is what matters. May God bless our work together.

Dialectal Translation

Female -> Male · Wholesale Buyer · Turn 1

بنسبة ليا، لحاج، هادشي كتر من غير ورقة. بغيت نبني معاك شراكة long-term، تكون مبنية على ثقة.

Male -> Female · Farmer · Turn 2

عندك لحق ا بنتي. ل contrat غير مداد، ولكن ثقة هي لي مهمة. الله يبارك فخدمتنا.

Subtask 2

Cross-Dialect Arabic Machine Translation

Domain: Financial services Intent: card arrival Source variety: Modern Standard Arabic

MSA Source

ماذا افعل إن لم أستلم بطاقتي الجديدة؟

Moroccan Translation

شنو ندير إلا ماوصلتنيش لاكارط جديدة ديالي؟

Saudi Translation

وش أسوي إذا ما استلمت بطاقتي الجديدة؟

Tunisian Translation

آش نعمل كان ما جاتنيش كارطتي الجديدة؟

Subtask 3

MT Error Detection and Classification

Direction: English to Moroccan dialect Output: highlighted span | error type

Source

Okay, I'll close my mouth and sew it up with needle and thread. I will not speak at all.

MT Prediction with Tagged Errors

طيب | Sociolinguistics، هادف | Semantics فمي و خيطو | Morphosyntax بِالْإِزْرِ | Semantics وَ الخيط. ما غادي | Morphosyntax نكلمش | Morphosyntax بزاف.

Important Dates

AlexandriaX-2026 shared-task timeline.

Task Launch

Release task website, documentation, and registration form.

Training and Development Release

Release task training/development data, baseline code, and evaluation scripts.

Registration Deadline and Blind Test Release

Teams register and receive blind test inputs for final evaluation.

Final System Output Deadline

Submission deadline for final predictions.

Final Results Released

Official rankings and diagnostic breakdowns shared with participants.

System Description Papers Due

Camera-ready system descriptions are due.

Shared Task Overview Paper Due

Overview paper describing the task, data, evaluation, and official results is due.

Conference Camera-ready Deadline

Final camera-ready materials due for the conference proceedings.

ArabicNLP / EMNLP Presentation Period

Shared task overview and participant systems presented during the conference period.

Participation

Multiple entry points for teams with different resources.

Teams may participate in any subset of the three subtasks.

Submission Requirements

  • Submit system outputs for the blind test set in the specified format.
  • Use only permitted data and model sizes for the constrained MT track.
  • Include enough system details for result verification.
  • Prepare a system description paper following ArabicNLP instructions.

Contact

Questions and announcements

Organizer Contact

For task questions, registration issues, or release coordination, contact the organizing team.

alexandriax2026@gmail.com

Organizing Committee

AlexandriaX-2026 organizers

The University of British Columbia logo King Fahd University of Petroleum and Minerals logo VinUniversity logo Hamad Bin Khalifa University logo King Abdullah University of Science and Technology logo