+++ title= "Wu Dao 2.0: China’s Answer To GPT-3. Only Better" date= 2021-07-20T00:44:06+08:00 tags = ["AI"] type = "blog" categories = ["AI"] banner = "img/banners/banner-3.jpg" +++
## Wu Dao 2.0: China’s Answer To GPT-3. Only Better
The Chinese govt-backed Beijing Academy of Artificial Intelligence’s (BAAI) has introduced Wu Dao 2.0, the largest language model till date, with 1.75 trillion parameters. It has surpassed OpenAI’s GPT-3 and Google’s Switch Transformer in size. HuggingFace DistilBERT and Google GShard are other popular language models. Wu Dao means ‘enlightenment’ in English.
“Wu Dao 2.0 aims to enable ‘machines’ to think like ‘humans’ and achieve cognitive abilities beyond the Turing test,” said Tang Jie, the lead researcher behind Wu Dao 2.0. The Turing test is a method to check whether or not a computer can think like humans.
Smartphone maker Xiaomi, short-video giant Kuaishou, on-demand service provider Meituan, 100 plus scientists and multiple organisations have collaborated with BAAI on this project.
Wu Dao 2.0
The Wu Dao 2.0 is a pre-trained AI model that uses 1.75 trillion parameters to simulate conversational speech, writes poems, understand pictures and even generate recipes. The next generation Wu Dao model can also predict the 3D structures of proteins, similar to DeepMind’s AlphaFold and power virtual idols. Recently, China’s first virtual student, Hua Zhibing, was built on Wu Dao 2.0.
The language model Wu Dao 2.0 was trained with FastMoE, a Fast Mixture-of-Expert (MoE) training system similar to Google’s Mixture of Experts. Unlike Google’s MoE, FastMoE is an open source system based on Pytorch (Facebook’s open-source framework) with common accelerators. It provides a hierarchical interface for flexible model design and easy adaption to various applications like Transformer-XL and Megatron-LM. The source code of FastMoE is available here.
“[FastMoE] is simple to use, high-performance, flexible, and supports large-scale parallel training,” wrote BAAI in its official WeChat blog.
Result-wise, Wu Dao 2.0 has surpassed SOTA levels on nine benchmark tasks, including:
Showcasing benchmark tasks where Wu Dao 2.0 surpasses other SOTA models (Source: BAAI)
Towards multimodal model
Currently, AI systems are moving towards GPT-like multimodal and multitasking models to achieve artificial general intelligence (AGI). Experts believe there will be a rise in multimodal models in the coming months. Meanwhile, some are rooting for embodied AI, rejecting traditional bodiless models, such as neural networks altogether.
Unlike GPT-3 , Wu Dao 2.0 covers both Chinese and English with skills acquired by studying 4.9 terabytes of texts and images, including 1.2 terabytes of Chinese and English texts.
Google has also been working towards developing a multimodal model similar to Wu Dao. At Google I/O 2021, the search giant unveiled language models like LaMDA (trained on 2.6 billion parameters) and MUM (multitask unified model) trained across 75 different languages and 1000x times more powerful than BERT. At the time, Google CEO Sundar Pichai said that LaMDA, trained on only text, will soon shift to a multimodal model to integrate text, image, audio and video.
The training data of Wu Dao 2.0 include:
1.2 terabytes of English text data in the Pile dataset
1.2 terabytes of Chinese text in Wu Dao Corpora
2.5 terabytes of Chinese graphic data
Blake Yan, an AI researcher from Beijing, told South China Morning Post that these advanced models, trained on massive datasets, are good at transfer learning, just like humans. “Large -scale ‘pre-trained models’ are one of today’s best shortcuts to AGI,” said Yan.
“No one knows which is the right step,” said OpenAI on its GPT-3 demo blog post, “Even if larger ‘pre-trained models’ are the logical trend today, we may be missing the forest for the trees, and we may end up reaching a less determined ceiling ahead. The only clear aspect is that if the world has to suffer from ‘environmental damage,’ ‘harmful biases,’ or ‘high economic costs,’ not even reaching AGI would be worth it.”
Join Our Telegram Group. Be part of an engaging online community. Join Here.
Subscribe to our Newsletter
Get the latest updates and relevant offers by sharing your email.
## Turing NLG, GPT-3 & Wu Dao 2.0: Meet The Who’s Who Of Language Models
Language modelling involves the use of statistical and probabilistic techniques to determine the probability of a given sequence of words in a sentence. To make word predictions, language models analyse preceding text data. Language modelling is usually used in applications such as machine translations and question-answer tasks. Many researchers and developers working on building robust and efficient language models posit that larger models, trained on a higher number of parameters, produce better outcomes. In this article, we compare three massive language models to find out if the theory holds.
Turing NLG
Microsoft introduced Turing NLG in early 2020. At that time, it held the distinction of being the largest model ever published, with 17 billion parameters. A Transformer-based generative language model, Turing NLG or T-NLG is part of the Turing project of Microsoft, announced in 2020.
T-NLG can generate words to complete open-ended textual tasks and unfinished sentences. Microsoft has claimed the model can generate direct answers to questions and summarise documents. The team behind T-NLG believes that the bigger the model, the better it performs with fewer training examples. It is also more efficient to train a large centralised multi-task model rather than a new model for every task individually.
T-NLG is trained on the same type of data as NVIDIA’s Megatron-LM and has a maximum learning rate of 1.5×10^-4. Microsoft has used DeepSpeed, trained on 256 NVIDIA GPUs for more efficient training of large models with fewer GPUs.
GPT-3
In July last year, OpenAI released GPT-3–an autoregressive language model trained on public datasets with 500 billion tokens and 175 billion parameters– at least ten times bigger than previous non-sparse language models.To put things into perspective, its predecessor GPT-2 was trained on just 1.5 billion parameters.
GPT-3 is applied without any gradient updates or fine-tuning. It achieves strong performance on many NLP datasets and can perform tasks such as translation, question-answer, reasoning, and 3-digit arithmetic operations.
OpenAI’s language model achieved promising results in the zero-shot and one-shot settings, and occasionally surpassed state-of-the-art models in the few-shot setting.
GPT-3 has a lot of diverse applications, including:
The Guardian published an entire article written using GPT-3 titled “A robot wrote this entire article. Are you scared yet, human?” The footnote said the model was given specific instructions on word count, language choice, and a short prompt.
A short film of approximately 4 minutes–Solicitors was written by GPT-3.
A bot powered by GPT-3 was found to be interacting with people in a Reddit thread.
The industry’s reaction towards GPT-3 has been mixed. The language model has courted controversy over inherent biases, tendency to go rogue when left to its own devices, and its overhyped capabilities.
Wu Dao 2.0
Wu Dao 2.0 is the latest offering from the China government-backed Beijing Academy of Artificial Intelligence (BAAI). It is the latest and the largest language model till date with 1.75 trillion parameters. It has surpassed previous models such as GPT-3, Google’s Switch Transformer in size. Unlike GPT-3 , Wu Dao 2.0 covers both Chinese and English with skills acquired by studying 4.9 terabytes of texts and images, including 1.2 terabytes of Chinese and English texts.
It can perform tasks such as simulating conversational speech, writing poetry, understanding pictures, and even generating recipes. It can also predict the 3D structures of proteins like DeepMind’s AlphaFold. China’s first virtual student Hua Zhibing was built on Wu Dao 2.0.
Wu Dao 2.0 was trained with FastMoE, a Fast Mixture-of-Expert (training system). FastMoE is a PyTorch-based open source system akin to Google’s Mixture of Experts. It offers a hierarchical interface for flexible model design and easy adoption to applications such as Transformer-XL and Megatron-LM.
Are bigger models better?
The size of the language models are increasing. Bigger models are assumed to be better at generalising and taking us a step closer towards artificial general intelligence.
Former Google AI researcher Timnit Gebru detailed the associated risks of large language models in her controversial paper “On the Dangers of Stochastic Parrots: Can Language Models Be Too Big?”. The paper argued although these models were extraordinarily good and could produce meaningful results, they carry risks such as huge carbon footprints.
Echoing similar sentiments, Facebook’s Yann LeCun said, “It’s entertaining, and perhaps mildly useful as a creative help. But trying to build intelligent machines by scaling up language models is like building high-altitude airplanes to go to the moon. You might beat altitude records, but going to the moon will require a completely different approach.”
All the three discussed language models have been introduced within a span of just one and a half years. The researcher communities around the world are gearing up to develop the next ‘biggest’ language model to achieve unparalleled efficiency at task execution and getting close to the AGI holy grail. However, the lingering question here is whether this is the right way to achieve AGI, especially when in the face of risks including bias, discrimination, and environmental costs.
## ドゥジャリク国連事務総長報道官、東クドゥスの状況を懸念
1663626
ドゥジャリク報道官は日次記者会見で、政治的指導者や社会的指導者が暴力、挑発、ヘイトスピーチに反対することが重要であるとも強調した。
イスラエルの極右政党である宗教的シオニスト党のベザレル・スモットリッチ党首は、6月22日朝に同行の熱狂的なユダヤ教徒らと共にシェイク・ジャラー地区に立ち入りを行い、立ち退きの恐れに直面しているパレスチナ人の家に入ろうとしている。
イスラエル警察が警護していたスモットリッチ党首はまた同地区にあるユダヤ人入植者の家を訪問しており、パレスチナ人はその挑発行為によりスモットリッチ党首を非難している。
6月21日晩にもシェイク・ジャラー地区でパレスチナ人とユダヤ人入植者の間で喧嘩が発生し、イスラエル警察がパレスチナ人に音響爆弾や催涙ガスを使って介入した結果7人が負傷している。
ラマザン(断食月)が始まった4月13日からアル・アクサー・モスクとその周辺及び占領下にある東クドゥスのシェイク・ジャラー地区ではイスラエル警察の攻撃により深刻な緊張が発生している。
(2021年6月23日)
## DeepMind AGI paper adds urgency to ethical AI
All the sessions from Transform 2021 are available on-demand now. Watch now.
It has been a great year for artificial intelligence. Companies are spending more on large AI projects, and new investment in AI startups is on pace for a record year. All this investment and spending is yielding results that are moving us all closer to the long-sought holy grail — artificial general intelligence (AGI). According to McKinsey, many academics and researchers maintain that there is at least a chance that human-level artificial intelligence could be achieved in the next decade. And one researcher states: “AGI is not some far-off fantasy. It will be upon us sooner than most people think.”
A further boost comes from AI research lab DeepMind, which recently submitted a compelling paper to the peer-reviewed Artificial Intelligence journal titled “Reward is Enough.” They posit that reinforcement learning — a form of deep learning based on behavior rewards — will one day lead to replicating human cognitive capabilities and achieve AGI. This breakthrough would allow for instantaneous calculation and perfect memory, leading to an artificial intelligence that would outperform humans at nearly every cognitive task.
We are not ready for artificial general intelligence
Despite assurances from stalwarts that AGI will benefit all of humanity, there are already real problems with today’s single-purpose narrow AI algorithms that calls this assumption into question. According to a Harvard Business Review story, when AI examples from predictive policing to automated credit scoring algorithms go unchecked, they represent a serious threat to our society. A recently published survey by Pew Research of technology innovators, developers, business and policy leaders, researchers, and activists reveals skepticism that ethical AI principles will be widely implemented by 2030. This is due to a widespread belief that businesses will prioritize profits and governments continue to surveil and control their populations. If it is so difficult to enable transparency, eliminate bias, and ensure the ethical use of today’s narrow AI, then the potential for unintended consequences from AGI appear astronomical.
And that concern is just for the actual functioning of the AI. The political and economic impacts of AI could result in a range of possible outcomes, from a post-scarcity utopia to a feudal dystopia. It is possible too, that both extremes could co-exist. For instance, if wealth generated by AI is distributed throughout society, this could contribute to the utopian vision. However, we have seen that AI concentrates power, with a relatively small number of companies controlling the technology. The concentration of power sets the stage for the feudal dystopia.
Perhaps less time than thought
The DeepMind paper describes how AGI could be achieved. Getting there is still some ways away, from 20 years to forever, depending on the estimate, although recent advances suggest the timeline will be at the shorter end of this spectrum and possibly even sooner. I argued last year that GPT-3 from OpenAI has moved AI into a twilight zone, an area between narrow and general AI. GPT-3 is capable of many different tasks with no additional training, able to produce compelling narratives, generate computer code, autocomplete images, translate between languages, and perform math calculations, among other feats, including some its creators did not plan. This apparent multifunctional capability does not sound much like the definition of narrow AI. Indeed, it is much more general in function.
Even so, today’s deep-learning algorithms, including GPT-3, are not able to adapt to changing circumstances, a fundamental distinction that separates today’s AI from AGI. One step towards adaptability is multimodal AI that combines the language processing of GPT-3 with other capabilities such as visual processing. For example, based upon GPT-3, OpenAI introduced DALL-E, which generates images based on the concepts it has learned. Using a simple text prompt, DALL-E can produce “a painting of a capybara sitting in a field at sunrise.” Though it may have never “seen” a picture of this before, it can combine what it has learned of paintings, capybaras, fields, and sunrises to produce dozens of images. Thus, it is multimodal and is more capable and general, though still not AGI.
Researchers from the Beijing Academy of Artificial Intelligence (BAAI) in China recently introduced Wu Dao 2.0, a multimodal-AI system with 1.75 trillion parameters. This is just over a year after the introduction of GPT-3 and is an order of magnitude larger. Like GPT-3, multimodal Wu Dao — which means “enlightenment” — can perform natural language processing, text generation, image recognition, and image generation tasks. But it can do so faster, arguably better, and can even sing.
Conventional wisdom holds that achieving AGI is not necessarily a matter of increasing computing power and the number of parameters of a deep learning system. However, there is a view that complexity gives rise to intelligence. Last year, Geoffrey Hinton, the University of Toronto professor who is a pioneer of deep learning and a Turing Award winner, noted: “There are one trillion synapses in a cubic centimeter of the brain. If there is such a thing as general AI, [the system] would probably require one trillion synapses.” Synapses are the biological equivalent of deep learning model parameters.
Wu Dao 2.0 has apparently achieved this number. BAAI Chairman Dr. Zhang Hongjiang said upon the 2.0 release: “The way to artificial general intelligence is big models and [a] big computer.” Just weeks after the Wu Dao 2.0 release, Google Brain announced a deep-learning computer vision model containing two billion parameters. While it is not a given that the trend of recent gains in these areas will continue apace, there are models that suggest computers could have as much power as the human brain by 2025.
Source: Mother Jones
Expanding computing power and maturing models pave road to AGI
Reinforcement learning algorithms attempt to emulate humans by learning how to best reach a goal through seeking out rewards. With AI models such as Wu Dao 2.0 and computing power both growing exponentially, might reinforcement learning — machine learning through trial and error — be the technology that leads to AGI as DeepMind believes?
The technique is already widely used and gaining further adoption. For example, self-driving car companies like Wayve and Waymo are using reinforcement learning to develop the control systems for their cars. The military is actively using reinforcement learning to develop collaborative multi-agent systems such as teams of robots that could work side by side with future soldiers. McKinsey recently helped Emirates Team New Zealand prepare for the 2021 Americas Cup by building a reinforcement learning system that could test any type of boat design in digitally simulated, real-world sailing conditions. This allowed the team to achieve a performance advantage that helped it secure its fourth Cup victory.
Google recently used reinforcement learning on a dataset of 10,000 computer chip designs to develop its next generation TPU, a chip specifically designed to accelerate AI application performance. Work that had taken a team of human design engineers many months can now be done by AI in under six hours. Thus, Google is using AI to design chips that can be used to create even more sophisticated AI systems, further speeding-up the already exponential performance gains through a virtuous cycle of innovation.
While these examples are compelling, they are still narrow AI use cases. Where is the AGI? The DeepMind paper states: “Reward is enough to drive behavior that exhibits abilities studied in natural and artificial intelligence, including knowledge, learning, perception, social intelligence, language, generalization and imitation.” This means that AGI will naturally arise from reinforcement learning as the sophistication of the models matures and computing power expands.
Not everyone buys into the DeepMind view, and some are already dismissing the paper as a PR stunt meant to keep the lab in the news more than advance the science. Even so, if DeepMind is right, then it is all the more important to instill ethical and responsible AI practices and norms throughout industry and government. With the rapid rate of AI acceleration and advancement, we clearly cannot afford to take the risk that DeepMind is wrong.
Gary Grossman is the Senior VP of Technology Practice at Edelman and Global Lead of the Edelman AI Center of Excellence.
## 中国の研究チームが新たなAI「悟道2.0」を発表、パラメーター数は1兆7500億でGoogleとOpenAIのモデルを上回る
2021年06月04日 16時00分 サイエンス
中国の研究チームが新たなAI「悟道2.0」を発表、パラメーター数は1兆7500億でGoogleとOpenAIのモデルを上回る
中国政府による資金援助を受けている北京智源人工知能研究院が主導する研究チームが2021年6月1日、新たな事前学習モデルである「悟道2.0(WuDao 2.0)」を発表しました。悟道2.0は1兆7500億ものパラメーターを使用しており、これはOpenAIやGoogle傘下のGoogle Brainが開発した事前学習モデルを上回る数だとのことです。
US-China tech war: Beijing-funded AI researchers surpass Google and OpenAI with new language processing model | South China Morning Post
China's gigantic multi-modal AI is no one-trick pony | Engadget
https://www.engadget.com/chinas-gigantic-multi-modal-ai-is-no-one-trick-pony-211414388.html
China Says WuDao 2.0 AI Is an Even Better Conversationalist than OpenAI, Google | Tom's Hardware
https://www.tomshardware.com/news/china-touts-wudao-2-ai-advancements
悟道2.0は非営利の研究機関である北京智源人工知能研究院を中心として、複数の機関に所属する100人を超える研究者らによって開発された深層学習モデルです。パラメーター数は1兆7500億に達しており、OpenAIが2020年6月に発表した言語処理モデル「GPT-3」の1750億や、Google Brainが開発した言語処理モデル「Switch Transformer」の最大1兆6000億という数を上回るものだと北京智源人工知能研究院の研究者らは主張しています。
パラメーターは機械学習モデルによって定義される変数であり、学習によってモデルが進化するにつれてパラメーターは洗練され、より正しい結果を得ることが可能になります。そのため、一般的にはモデルに含まれるパラメーター数が多いほど機械学習モデルは洗練される傾向があるとのこと。
悟道2.0は合計4.9TBのテキストおよび画像データで訓練されており、この訓練データには中国語と英語のテキストをそれぞれ1.2TBずつ含んでいるとのこと。また、画像生成や顔認識といった特定のタスクに特化した深層生成モデルとは違い、エッセイや詩を書いたり、静止画像に基づいて補足する文章を生成したり、文章の説明に基づいて画像を生成したりすることもできるそうです。
北京のAI研究者であるBlake Yan氏は、「巨大なデータセットで訓練されたこれらの洗練されたモデルは、特定の機能に使用する場合、少量の新たなデータしか必要としません。なぜなら、人間と同じようにかつて学習した知識を新たなタスクに転用できるためです」とコメント。すでに悟道2.0はスマートフォンメーカーのXiaomiをはじめとする22の企業と提携していると、サウスチャイナ・モーニング・ポストは報告しています。
北京智源人工知能研究院のZhang Hongjiang院長は、「大規模な事前学習モデルは、汎用人工知能へ向かう最良の近道の1つです」と述べ、悟道2.0が汎用人工知能を見据えたものだと示唆しました。
なお、中国政府は北京智源人工知能研究院に多額の投資を行っており、2018年と2019年だけで3億4000万元(約58億5000万円)の資金が提供されたとのこと。アメリカ政府も2020年に、AIと量子コンピューティングに1000億円超の投資を行うことを発表するなど、米中間におけるテクノロジー競争は激化しています。
テクノロジー系メディアのTom's Hardwareは悟道2.0の発表について、AIの性能においては必ずしもパラメーターの数だけが重要ではなく、データセットの量や内容も重要だと指摘。たとえばGPT-3はわずか570GBのデータで訓練されましたが、このデータは事前処理によって45TBのデータセットからしぼられたデータだったとのこと。そのため、「悟道2.0に関連する生の数値は印象的ですが、モデルのパフォーマンスを示すものではない可能性があります」と主張しました。