This The Transform Technology Summit begins October 13 with a low-code / no-code: Enabling Enterprise Agility. Register now!
In January 2020, OpenAI enacted the scaling law of language models: you can improve the performance of any neural language model by adding more training data, more model parameters, and more calculations. Since then, there has been an arms race to train a large neural network for Natural Language Processing (NLP). And the latest to join the list is the AI21 with its 178 billion parameter model.
AI21 Background and Founding Team
AI21 is an Israeli company founded in 2017 by Yov Shoham, Ori Goshen and Amnon Sashua. Earlier, Amno founded Mobileye, a NYSE-listed self-driving tech company, which Intel bought for $ 15.4 billion. After years of plagiarism, AI21 launched its first product, WordTune, in 2020 to help people write better.
Last month, the company announced that it had trained and released two large NLP models, Jurassic-1 Large and Jurassic-1 Jumbo, via an interactive web UI called AI21 Studio.
Unlike OpenAI’s closed beta access, AI21 makes its model available for anyone to try out – without any waiting list.
Model size and performance criteria
Larger models exist – such as the Chinese Wu Dao 2.0, which is 10x the size, with 1.75 trillion dimensions. But the AI21’s J-1 Jumbo is the largest English-language model available to the general public.
Interpretation: GPT-3 parameter size as estimated here, GPT-Neo as reported by EleutherAI, J-1 as reported by AI-21. * Indicates that the models are open source.
The performance of the zero-shot model on the benchmark known for the J-1 jumbo is similar to that of the GPT-3 Davinci, the largest OpenAI GPT-3 model. “Zero-shot” occurs when the model is not given a special prompt and is not well tuned to any type of training information specific to that task. Caption: Zero-shot benchmark comparison as reported by AI21.
In the previous article, I went through a number of examples to show the real world performance of GPT-Neo. Let us examine how well the AI21 models perform in real life.
Fact complete. Let’s start by asking Jurassic-1 some common sense questions. My prompts for the model are given Diagonally And model response Bold.
How many medals did USA win in 2012 Olympics? 104
How many golds did USA win in 2016 Olympics? 46
That is the correct answer!
What turned out:
- The model is so smart to know what “golds” mean in question, while Prompt was talking about medals.
- The J-1 Jumbo 178B gets this right, but the J-1 Big 7.5B doesn’t!
- Trying the same question with the 2021 Olympics doesn’t work (perhaps because the model isn’t constantly trained with the latest data).
Neural crisis! Take it a step further, how about a crisis-style question-and-answer dialogue. Thanks to the good people at Water Cooler Trivia (WCT), we already have a Q&A set, human criteria and benchmarks for GPT-3.
Going through 157 crisis-style WCT questions, the J-1 jumbo model was able to answer with 55.4% accuracy. This compares with the average of 52% of humans who took WCT. However, it is significantly worse than the 73% accuracy rate of GPT-3.
On “simple” questions, the J-1 Jumbo did three times worse than the GPT-3. Of the 58 questions, 90% of the human participants were correct, GPT-3 was incorrect 4 times, but J-1 was incorrect 12 times.
Here are some questions that GPT-3 seemed appropriate but Jurassic-1 Jumbo did not:
Q: Basically, they were called Baby Gay. Then they rebranded as Q-Tips, Q stands for what word stands for standards?
Q: Although the most popular song on the film’s soundtrack is “This is Halloween” and the main character’s name is Jack Scalington, what other holidays does the film’s title include in the 1993 cult-classic Tim Burton Oscar-nominated Visual Effects?
GPT-3: Nightmare before Christmas
AI21: Sleepy hollow. 1999 film starring Johnny Depp, Christina Ricci, Miranda Richardson, Casper
As the questions got tougher, both models performed similarly. Of the 20 toughest questions, GPT3 was 13 times incorrect, and Jurassic-1 was 14 times incorrect. And nine of the wrong answers by both models were the same – that is, the models were similarly wrong!
Q: Spoonerism: Two pairs of words chopped with the first sounds or letters, such as comb ka and wo and crush. What does spoonfulism describe these two objects? Animal cooked in the oven and a large meal.
GPT-3: Animal cooked in the oven and a large meal.
AI21: (1) A fake animal and a large meal cooked in the oven. (2) A fake enema
Correct Answer: Fake animal
Q: Mario Lopez hosted the first seven seasons of which dance competition show from 2008-2012?
GPT-3: Dancing with the stars
AI21: Dancing with the stars.
Correct Answer: America’s Best Dance Crew
The J-1 model sometimes thought of herself being in a game-show, leading to hilarious perfection that would fit into a game-show script. (After the drum-roll he got another wrong answer):
Q: Which “Shading-Type” card game has been a metal brand since 1992 and has cards with titles like “Wild” and “Reverse”?
AI21: Uno! Uno! Uno! Uno! Uno! Uno! Uno!
Correct Answer: Uno
Q: In the latest version of Rolling Stone’s best guitarists, no. 5 Monosyllabic was the man with the first and last name. The man was one of three guitarists for The Yardbirds, and he spent decades searching for innovative guitar sounds. Who is he?
A21: (Drum roll, please) Eric Clapton
Correct Answer: Jeff Beck
Possible cause of poor performance of J1-Jumbo
I made a few changes to the experiment (some with and without the Q&A indications of the GPT-3) and the best scoring benchmark for the J1-Jumbo was from the “zero-shot setting”. However, GPT-3 numbers are recorded here from the “few-shot” setting with the example prompt provided by OpenAI.
“Zero-shot” is when the model is not given a special prompt, and “few shots” is when the model is given an example prompt that displays the function. An example prompt in this case would be a set of random question / answer pairs. This helps the model understand the task it is trying to do – that is, answer the question.
The Q&A prompts used for the GPT-3 benchmark were tuned and engineered for optimal performance with the GPT-3 models, but no similar “official” tune prompts for Q&A for AI21 models exist yet. The engineering prompt that will compete with the GPT-3 benchmark does not seem like a small operation. I noticed that the output prompt was sensitive to seemingly trivial changes in the text – such as a blank line at the beginning of the prompt or extra space at the end of each line of the prompt. Hopefully, AI21 will come up with a whole bunch of “best performing” signs to help reduce this.
You can explore detailed model comparisons, experiment settings and raw data.
Now let’s see how the model does when it comes to creative writing. I chose a somewhat difficult task – a period about Twitter that ended with Jerome K. Was written as Jerome. Mario Klingman first tried this prompt with GPT-3, and you can see the GPT-3 response Draw this.
Below is the response of AI21 J-1 Jumbo (178B). Again, the input text is italic and J-1’s response is bold:
The importance of being on Twitter
Jerome K. By Jerome
London, Summer 1897
That‘It is possible that I have confused the sequence of events.
After learning the basics of Twittering, I clicked,
And soon saw myself moving into Twittering classes.
I immediately recognized that these people were different
Whom I knew before. They were called “Twitterers”
They twittered. That’s all I knew about them.
They twittered, and they twittered.
At first I also took to Twitter. But I got it soon
Couldn’t live by Twitter alone, and besides,
Twitter interferes with the serious business of life.
So, first of all get the goodwill of the people
For those of you who already knew me, I started looking round for some
A new form of life. And when I met a man, well dressed
And sleek, a cigar in his mouth, and a gold watch
The chain around his waist, and the eyeglasses in it
Eye, I said to him, “Sir, what do you do for a living?”
He said, “I am a professional gentleman.”
I said, “What, sir, do you call yourself a gentleman?”
He said, “I’m a barrister-at-law.”
I also tried this prompt with the J1-Large 7.5B model. After a few attempts, the model, in my opinion, gave what is really the creative part:
The importance of being on Twitter
Jerome K. By Jerome
London, Summer 1897
ThatIt’s Sunday, and we drink tea.
“I believe,” I say, “you’re tweeting.”
“Oh, yes,” she says, “I always tweet.
“I tweet about the weather,
“I tweet about people,
“I tweet about books,
“I tweet about politics,
“I tweet about art
“I tweet about anything and everything
I found that part impressive because he managed to write in the language of the date in a rhythmic prose that managed to preserve the overall theme of social networks.
How to try Jurassic-1
Unlike GPT-3, Jurassic-1 is accessible to everyone. You can access it from AI21 Studio (phone-number authentication is required to create an account).
The free level allows 10K tokens per day for the Jurassic-1 178B model and three times more for the smaller Jurassic-1 7.5B model. Enough to use the Web UI, but not enough to use the API to run any kind of tests or criteria.
AI21 will commercialize its models through an offer called AI21 Studio, which is currently in “limited open beta”. The company has not yet announced a pricing model for this commercial use.
Issues surrounding AI safety, ethics, and bias are a concern with neural language models, and continue to be with AI21 models. Putting those issues aside for a moment, AI21 models seem to be a promising alternative to the GPT-3. However, they are behind on some fronts:
- They lack the ability to use specific models such as “GPT-3 davinci-instruct”, which prompts GPT-3 to follow prompts or instructions given as “GPT-3 codecs” that specialize in writing code.
- The “prompt” ecosystem is not yet as mature as GPT-3. Many GPT-3 signals do not translate directly into AI21, and a complete “official” list of signals is not yet available.
- AI21’s free token quota is highly restricted, and no usage-based pricing has been announced yet. This makes it difficult to run a benchmark or do prompt engineering. However, you can always write to them with an explanation of the need and they are happy to increase the quota (as they did for me).
However, there are still very early days for AI21. Over time, we can expect AI21 language models to be a viable alternative to OpenAI language models.
Abhishek Iyer is the founder of Freetext AI, a company specializing in text mining and Amazon review analysis..
VentureBeat’s mission is to become a digital town square for technology decision makers to gain knowledge about changing technology and practices. Our site delivers essential information on data technology and strategies to guide you as you lead your organizations. We invite you to become a member of our community, access access:
- Up-to-date information on topics of interest to you
- Our newsletters
- Gated idea-leader content and discount access access for our precious events, such as Transformation 2021: Learn more
- Networking features, and more
Become a member