Skip to content
Draft
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
48 changes: 44 additions & 4 deletions benchmark-tasks.rst
Original file line number Diff line number Diff line change
Expand Up @@ -95,7 +95,7 @@ Image classification is one of the most important problems in computer vision an
Besides, in each round workers access disjoint set of datapoints.


Implementation details:
**Implementation details:**

#. **Data Preprocessing**
We followed the same approach described in :cite:`DBLP:journals/corr/HeZRS15`.
Expand Down Expand Up @@ -180,7 +180,7 @@ Task 3: Language Modelling
"""""""""""""""""""""""

#. **Model**
We benchmark the `AWD-LSTM <https://github.com/salesforce/awd-lstm-lm>`_ model.
We benchmark the ASGD Weight-Dropped LSTM (`AWD-LSTM <https://github.com/salesforce/awd-lstm-lm>`_) model.

#. **Dataset**
The `Wikitext2 <https://s3.amazonaws.com/research.metamind.io/wikitext/wikitext-2-v1.zip>`_ dataset is used.
Expand Down Expand Up @@ -225,6 +225,46 @@ Task 3: Language Modelling
The bandwidth between two nodes is around 7.5Gbit/s. ``MPI``, ``GLOO`` or `NCCL` are used for communication.


.. _benchmark-task-3b:

3a. BERT, Wikidump-20200101
"""""""""""""""""""""""""""
#. **Model**
TODO
#. **Dataset**
TODO

#. **Training Algorithm**
TODO

**Implementation details:**

#. **Data Preprocessing**
The data needs to be downloaded and pre-processed using the pre-processing script
``mlbench_core/dataset/nlp/pytorch/wikidump/preprocess/download_dataset.sh <data_dir>`` before training.
The raw dataset is available on our S3 `here <https://storage.googleapis.com/mlbench-datasets/wikidump/enwiki-20200101-pages-articles-multistream.xml.bz2>`_,
as well as the pre-processed data here https://storage.googleapis.com/mlbench-datasets/wikidump/processed/part-00XXX-of-00500,
where `XXX` goes from `000` to `500`.

After pre-processing, the training data needs to be created using the script
``mlbench_core/dataset/nlp/pytorch/wikidump/preprocess/create_pretraining_data.py``.
Please run it using the following command (for each of the 500 files)

.. code-block:: bash

$ cd mlbench_core/dataset/nlp/pytorch/wikidump/preprocess/

$ python3 create_pretraining_data.py \
--input_file=<path to ./results of previous step>/part-XX-of-00500 \
--output_file=<tfrecord dir>/part-XX-of-00500 \
--vocab_file=vocab.txt \
--do_lower_case=True \
--max_seq_length=512 \
--max_predictions_per_seq=76 \
--masked_lm_prob=0.15 \
--random_seed=12345 \
--dupe_factor=10

Task 4: Machine Translation
^^^^^^^^^^^^^^^^^^^^^^^^^^^

Expand Down Expand Up @@ -278,7 +318,7 @@ Task 4: Machine Translation
+ ``scale_window = 128`` (steps after upscale if no overflow/underflow)


Implementation details:
**Implementation details:**

#. **Data Preprocessing**
The data needs to be downloaded and pre-processed and tokenized using the pre-processing script
Expand Down Expand Up @@ -371,7 +411,7 @@ Implementation details:
+ ``scale_window = 2000`` (steps after upscale if no overflow/underflow)


Implementation details:
**Implementation details**:

#. **Data Preprocessing**
The data needs to be downloaded and pre-processed and tokenized using the pre-processing script
Expand Down