Bilingual LLMs are becoming increasingly important in our interconnected world, where language diversity is a common challenge. They have the potential to break down language barriers, promote cross-cultural understanding, and improve access to information and services for people who speak different languages. Bilingual LLMs can be used to provide high-quality machine translation services. They can translate text from one language to another, helping break down language barriers and facilitate communication across different cultures and regions.
With the growth in the need for these models, there is a growth in the trend of commercialization and the need for more transparency. Many organizations only make the model checkpoints publicly available and withhold the vital information of a model. To regain transparency in AI, the researchers at Kunlun Technology built a family of large language models trained on over 3.2 trillion tokens drawn from both English and Chinese texts with a comprehensive disclosure. It is called Skywork – 13B.
Skywork-13B family includes Skywork-13B-Base and Skywork-13BChat. The base is a strong foundation model with state-of-the-art Chinese language modelling capability, and the chat is a fined-tuned version optimized for conversations. Unlike other organizations, they disclose detailed information on the training process and data composition.
They also released intermediate checkpoints, which provide a valuable resource for understanding how the model’s capabilities develop throughout training. They believe this disclosure enables other researchers to leverage the checkpoints for their use cases. They also developed a novel method that detects the level of in-domain data usage during the training stage.
The team trained the Skywork-13B foundation model on SkyPile. Instead of training it on SkyPile as a whole, they followed a two-stage training approach. In the first stage, they constitute the primary pretraining phase, which involves training the model from scratch on SkyPile-Main. In the second stage, it is optimized with STEM-related domain knowledge and problem-solving skills through continual pretraining on SkyPile-STEM.
During the model’s training, the team examined the language modeling loss across numerous reserved validation sets, each reflecting a distinct data distribution by creating separate validation sets for code, academic publications, social media posts, and web texts in Chinese and English. They say following this approach leads to ease in construction, simplicity in computation, high sensitivity to training progress, and model-agnosticism.
Skywork-13B model shows the best performance overall. It obtained the lowest average perplexity score of 9.42. It also exhibits the best performance across individual domains, achieving the lowest perplexity scores in the tech, movie, government, and finance domains. It excels not only in surpassing the performance of models of a similar size but also in outperforming significantly larger models such as InternLM-20B and Aquila2-34B.
Check out the Paper and Github. All credit for this research goes to the researchers of this project. Also, don’t forget to join our 32k+ ML SubReddit, 41k+ Facebook Community, Discord Channel, and Email Newsletter, where we share the latest AI research news, cool AI projects, and more.
If you like our work, you will love our newsletter..
We are also on Telegram and WhatsApp.
Arshad is an intern at MarktechPost. He is currently pursuing his Int. MSc Physics from the Indian Institute of Technology Kharagpur. Understanding things to the fundamental level leads to new discoveries which lead to advancement in technology. He is passionate about understanding the nature fundamentally with the help of tools like mathematical models, ML models and AI.