Summary of the PaLM model from Google
PaLM: Scaling Language Modeling with Pathways
[2204.02311] PaLM: Scaling Language Modeling with Pathways (arxiv.org)
The paper “PaLM: Scaling Language Modeling with Pathways” introduces the Pathways Language Model (PaLM), a 540-billion parameter, dense decoder-only Transformer model. This model was trained with the Pathways system, a system developed by Google Research to orchestrate distributed computation for accelerators, making it possible to efficiently train a single model across multiple TPU v4 Pods. This represents a significant increase in scale compared to most previous large language models (LLMs) that were trained on smaller configurations 12.
The PaLM model demonstrates breakthrough capabilities on numerous difficult tasks, including language understanding and generation, reasoning, and code-related tasks. It was evaluated on 29 widely-used English natural language processing (NLP) tasks, surpassing the few-shot performance of prior large models on 28 out of these 29 tasks. This evaluation spanned tasks like question-answering, cloze and sentence-completion tasks, Winograd-style tasks, in-context reading comprehension tasks, common-sense reasoning tasks, SuperGLUE tasks, and natural language inference tasks 3.
In addition to English NLP tasks, PaLM also showed strong performance on multilingual NLP benchmarks, including translation tasks, despite only 22% of the training corpus being non-English. It also achieved breakthrough performance on the Beyond the Imitation Game Benchmark (BIG-bench), a suite of more than 150 new language modeling tasks. Interestingly, the performance of PaLM as a function of scale follows a log-linear behavior similar to prior models, suggesting that performance improvements from scale have not yet plateaued 4.
By combining model scale with chain-of-thought prompting, PaLM showed breakthrough capabilities on reasoning tasks that require multi-step arithmetic or common-sense reasoning. For example, it solved 58% of the problems in a benchmark of challenging grade school level math questions, outperforming prior top scores achieved by fine-tuning the GPT-3 model with a training set of problems and combining it with an external calculator and verifier. This new score approaches the average of problems solved by 9–12 year olds, the target audience for the question set. It was suspected that separate encoding of digits in the PaLM vocabulary was a contributing factor to these performance improvements 56.
PaLM was also able to generate explicit explanations for scenarios that required a complex combination of multi-step logical inference, world knowledge, and deep language understanding. For instance, it could provide high-quality explanations for novel jokes not found on the web 7.
The training of the PaLM model represented the first large-scale use of the Pathways system to scale training to 6144 chips, the largest TPU-based system configuration used for training to date. The training data included a combination of English and multilingual datasets that featured high-quality web documents, books, Wikipedia, conversations, and GitHub code. Notably, a “lossless” vocabulary was created that preserved all whitespace, split out-of-vocabulary Unicode characters into bytes, and split numbers into individual tokens, one for each digit 8.