The New Artificial Intelligence Hype
Part 2: How to Run Stable Diffusion On Your Laptop
In the last few years, the hype around artificial intelligence has been increasing (again). Most of it is due to companies like OpenAI , Google , DeepMind (Google subsidiary), Meta , and others producing truly groundbreaking research and innovative showcases in the field. From machines winning complex games like Go and Dota 2 to a variety of content generation techniques that produce text, images, audio, and now video, these technologies will have an impact on our future.
It feels like we have experienced this hype towards AI in the past, but it never really materialized into anything relevant to our lives. From IBM’s Watson attempts to revolutionize healthcare to the prophecies of self-driving cars, we have been told about how AI will improve our society, yet there always seems to be something preventing us from getting there. On one side, technology might not be there yet for some of those advanced problems, in another, humans tend to be skeptical of machines taking over some of our areas of expertise (Skynet didn’t help here).
However, this time it feels different. Firstly, use cases are way less ambitious than in the past and have concrete practical (and fun) applications; secondly, research in the last 5-10 years had some of the major leaps ever in the machine and deep learning fields. Generative Adversarial Networks (GANs) , Diffusion Models , and Transformer Models are good examples of such breakthroughs. Thirdly, this time around the required technology and processing power are here to enable us to run and train these massive networks.
It is estimated that OpenAI spent around $10M to $20M to train its GPT-3 text-to-text model. Cost should be higher with models dealing with images.
Where Are We and How We Got Here? #
So, where are we right now? In the last 5 to 7 years, several specific innovations and practical applications of AI have brought forward the technology (and its respective implications) to public discussion. Before going into what is already possible, let’s go through the more relevant announcements in the last years.
2015 - Google creates DeepDream - Read More
Google releases a new method using Convulsion Networks that can dream new images based on its training set. The network could generate new images from cats, for example, after learning from tons of real cat images.
2016 - Google builds AlphaGo that beats Go world champion - Read More
AlphaGo which was trained using unsupervised learning techniques to make the network compete against itself millions of times to try to beat itself and get better at the game with each iteration. AlphaGo beat the Go champion and was even able to display Go moves that were never seen, showing that it had beyond learning moves from other games into discovering its unique plays.
2019 - OpenAI Five beats the Dota 2 champions - Read More
OpenAI Five was training using similar techniques to AlphaGo, this network went through millions of games against itself and got better and better. The challenge with playing a multiplayer online 3D game like Dota 2 was the immense action space possible to the player. OpenAI proved that by using its models and new training techniques, it was possible to solve these problems successfully.
2020 - OpenAI reveals GPT-3 - Read More
Generative Pre-trained Transformer 3 (GPT-3) is an autoregressive language model that uses deep learning to produce human-like text. The network was trained on more than 400B text tokens from a giant textual training set. The model can then keep writing text given an initial prompt. The impressive part is that more than being grammatically and syntactically correct, the story being told is coherent across sentences. Take a look at the video below if you want some examples of what it can do. For a more detailed explanation of what is happening, you can check this video in which a network comes up with a very believable story about a scientist that discovered unicorns in South America.
2021/22 - OpenAI announces Dall-E and Dall-E 2 - Read More and Here
Dall-E and Dall-E 2 are networks trained using diffusion models to be able to generate images from textual prompts. You can write a sentence and the AI will come up with an image for it in a short time frame. The model can output different types of styles, and previous images can be used to guide the creation of new ones.
2022 - Leap Motion releases Midjourney - Read More
Midjourney is also a text-to-image model. What someone can do with it is almost identical to Dall-E; however, there is a noticeable difference in the outputs it provides because of the different training sets. Not necessarily meaning that one is better than the other, just different.
2022 - Stable Diffusion released by a collaboration of Stability AI, CompVis LMU, and Runway with support from EleutherAI and LAION - Read More
Same as Dall-E and Midjourney, Stable Diffusion is another model to generate images from textual prompts. The main difference is that the entities that created this model made it open-source, which means that anyone can play around with it. This generated lots of buzz around it, as the previous models were proprietary at the time.
As of right now, it is possible to use most of these technologies either locally or through a service (e.g., OpenAI API) to generate text and images. It is possible to generate entire chapters for a book from small prompts of text, might not be a ready-to-release output, but at least it will help with writer’s block. It is also possible to generate images from text, images from images, and even in and out paint existing images. Furthermore, it is possible to erase part of an image you have, and have one of these models complete it using either another image or a text prompt. Additionally, it is also possible to extend an existing image using the same techniques (example below).
Is This Magic? #
All of these recent advancements are mainly attributed to three big milestones in Deep Learning research: Generative Adversarial Networks (GANs) , Diffusion Models , and Transformer Models .
GAN was a revolutionary framework for training massive networks without exactly having a complete set of data to do so. At a high level, the method defines that two different networks will try to compete against each other in a game where only one can win, learning and getting better at each interaction. Deepfakes , for example, are usually generated using this method. One network tries to generate a fake image of someone, and another one attempts to guess if it’s a fake or a real one. This method was also used to develop AlphaGO and OpenAI Five.
The problem with these techniques is that training is hard, and after the network knows how to fool the second one, there is little to no incentive to try interesting new things.
Enter Diffusion Models. These models were made so that the issue of generating a valid image doesn’t happen in one step, but along a denoize process that can take N steps. A training set is built by adding different levels of noise to valid real images (and their respective textual descriptions). The learning process then consists of the network learning how to remove noise in small amounts to get to the final image. This increases the control over the learning process and ends up producing networks that can produce a way bigger number of outputs than previously. If you want to learn more about how it all works, I recommend the video below.
Finally, we have Transformer Models, this was one of the most important advancements in the machine learning field, and arguably one of the cornerstones that makes everything we are seeing today possible. These models are neural networks that can learn context, and therefore infer meaning from sequential data.
Before transformers, networks relied on convolutional neural networks (CNNs) and recurrent neural networks (RNNs) to learn from a large labeled dataset. These took a long time and money to produce and increased the complexity of the final model. Transformers don’t require labeled datasets because they can find the patterns mathematically. This means that now it’s possible to train the new models with the trillions of images and petabytes of text data available on the internet and in company databases.
AI Democratization #
One of the main differences between this AI hype wave to past ones is that the number of people that can try it and interact with it is way bigger than it ever was. The internet made it possible to create services to explore what is possible and let people play around with it. In some cases, even create new business models for the companies behind these innovations. I am personally still wondering how many people pay OpenAI to play around with Dall-E.
From a different angle, there were never so many of these advances made available as open-source technologies that people can download, play around with, and even build upon of. OpenAI has recently released whisper and its Dall-E 2 model to the public. The Stable Diffusion model is also available to the community and there are already several remarkable projects behind it. If you are interested in running Stable Diffusion locally I wrote a tutorial on it, give it a try if you are interested.
In the last year, several machine learning models have become available to the public to generate images from textual descriptions. This has been an interesting development in the AI space. However, just recently did this technology became available for everyone to try.
One of the companies that have been spearheading these efforts is HuggingFace . The company provides tools that enable users to build, train, and deploy machine learning models based on open-source technologies and code. It also helps numerous parties share their models and build upon each other. An example of this is BLOOM , an open-source large language model created collaboratively among millions of researchers.
This AI democratization is a unique characteristic of this new hype wave the world is experiencing, which has the potential to entirely change the outcome of how it will impact our lives because of three reasons:
Use-cases are fun and everyone can try them - Unlike the self-driving prophecies or the all-knowing healthcare AIs of the 80s, these use-cases are way simpler and ubiquitous, therefore appealing to more people.
Almost everyone can try it even if you don’t understand how it works - available through open-source software licenses or via a website almost everyone who wants to, can try these out and have fun with them.
The community can build on it easily - The fact that some of these will be open to the public will exponentially increase the innovation that will happen in the space.
Ultimately, all the above reasons will contribute to making AI as a whole a more widespread and well-accepted technology, which hopefully will get us away from the pop-culture visions of movies like Terminator or The Matrix.
What Can You Do With It Today? #
These models and technologies are commoditizing the ability to generate content, which was the last step in the Idea Propagation Value Chain that had yet to be fundamentally disrupted by technology. The internet already entirely changed how we distribute content (the last part of the chain). Almost every file is digital, can be copied at zero cost, and sent almost instantaneously to anyone on earth. These new technologies will revolutionize the initial stages of the propagation value chain: the creation and substantiation of an idea.
Just considering the technologies I had a chance to play around with (Dall-e, Midjourney, and Stable Diffusion), the pre-requirements of learning to draw, paint, or model and render 3D content completely go away. Anyone will be able to tell to an artificial agent what they want to see, and it will create it for them.
As an example, I’ve used Stable Diffusion to generate the thumbnail for this article. I knew more or less what I wanted, so it just became a matter of going through a couple of dozens of ideas until I found something that I liked, some examples are below.
Moreover, if you run out of ideas, and need help with designing the prompts, there are already entire sites focused on indexing and providing the best prompts with examples of what others created. Lexica and Prompthero are two examples that I’ve tried with great results.
However, images are just the beginning…
Beyond Images #
I started playing around with Stable Diffusion a couple of weeks ago, and I have to admit that the news that came out since then blew my mind. As I was amazed at how easy it currently is to ask an AI to generate images, I realized that some projects are trying to go well beyond that.
It began when I came across this re-tweet from MKBHD:
— Marques Brownlee (@MKBHD) October 19, 2022
I was surprised that there were already such good results for text-to-video models and that so many companies were working on it. That week, I discovered a startup called Runway which is working on a video editor powered by all of these machine-learning innovations. A couple of days after, I’ve seen articles for Google’s new text-video network, Imagen Video , and Meta’s announcement of Make-a-Video .
After quickly discovering all the work happening also to generate 3D models from text and flat images, and animate 3D modes based on textual descriptions.
However, the most surprising one (and also a bit off-putting due to potential implications) was a podcast I came across of Joe Rogan interviewing Steve Jobs, created by podcast.ai . For those of you who don’t know, Joe Rogan has a widely successfully podcast show that runs for years and Steve Jobs is, well, dead. Those two men never had the chance to be in the same room together, however, and without their permission I imagine, there is 20 min of audio of them talking as if the conversation had happened.
While thinking about the impacts of using these technologies to emulate people who are no longer among us, I came across this article . So, not only there are some examples of people doing this with celebrities, there are companies like DeepBrain AI which already monetize such a service and can create a digital avatar of your lost loved ones.
Potential Pitfalls #
Throughout our history as a species, there were always problems and issues that had to be sorted out after a new invention came along.
Legal & Ethical #
One of the potential pitfalls is the legal and ethical Implications of these new AI systems and their impacts on society. For example, when generating an image using one of the text-to-image models in this article, who owns the final product? The person coming up with the prompt? The team that builds the model? The team that builds the training set? The artists whose images were on that set? All of them? None of them? None of that is sorted out at this stage, and it is already a big concern.
One of the relevant discussions happening right now regarding this topic is regarding GitHub’s Copilot product copyright issues. Copilot is an AI that was trained using all code repositories available on GitHub to empower a developer to code faster by turning comments into code, for example. How would you feel, having your code being used to generate potentially millions for a private company without getting a dime for it? There’s more information here if you are interested.
Artists are also finding out how their art was used to train these models and are not happy about it . Companies and startups also need to worry about IP infringement if they are using any of these solutions, or creating them.
Finally, there is an even bigger problem when considering that this technology can be used by ill-intentioned people. Generating images from people doing stuff they never did, or saying something they never did. This is the same issue with Deepfakes which already has several research initiatives happening, but it’s still a real concern. For what it is worth, some tools in this article made a great job making sure you can’t generate that type of content by adding safety filters to their services. However, for all the open-source ones, anyone has the power to override those safety measures.
All of these are very valid legal concerns within the industry that should be addressed ASAP, or there is a risk that all of it can turn into a legal storm that will take us back years. More than just the legal aspects, this technology has a very real potential to destroy someone’s life; therefore it should be thought through with time and low tolerance for mistakes.
Perceived Value & Backlash #
Initially, I thought that this tech would make everyone a good artist, but after playing around with it, I am not convinced that is the case anymore. What makes a good artist is more than just their raw execution ability. Factors like creativity, what actually do you want to create, and artistic knowledge are of super importance for having a good final product. At this stage, I think that these technologies will enable normal people to be able to create something, but will give current professional artists super-powers that will enable them to take their work to another level.
Having said this, the fact that these models enable us as a society to produce more, faster, and at a lower cost, will have an impact on the perceived value of its outputs. As an example, imagine a design department at a given news media publication with around 20 people. If the current technology becomes mainstream, probably that same department will not need 20 people.
There was a story not so long, about a journalist from The Atlantic that used Midjourney to generate images for an article and received massive backlash on Twitter. You can read his thoughts on what happened here . Given the already difficult and competitive environment in which some of these artists work, the potential pushback against these tools is understandable. There is a potential real impact on the job market. Even though it will be bad for some people in the short term, the real question is whether it will be good or bad in the long run. This phenomenon is quite common in big technological innovations and has happened several times throughout history.
Note: There is already a new area called Prompt Engineering , and others might appear soon.
Interestingly enough, legal concerns and human backlash have always been the major pitfalls for the adoption of any AI system in the past, more so than with technology in general.
What’s Next? #
I think the current applications of the already existing technology will be massive, and therefore whatever prediction one can make will have a high degree of uncertainty. These technologies affect the current Idea Propagation Value Chain, specifically in the parts of that chain that was not ever touched until now, creation and substantiation. This fact alone has the potential to impact us more than the internet, which changed the duplication and distribution parts of the chain, ever did. Only those impacts could be a discussion for pages and pages of an entire book series. If you are interested in this part of the topic, I highly recommend Ben Thompson’s article on it.
With the disclaimer above, here’s what I think will happen in this space in the next 2 to 5 years.
Legal issues around ownership will increase until a good solution comes up - We already discussed some potential legal issues in this article, if those are not solved, there is a risk of de-railing everything going on in the space. For the copyright ones, I think the grounds for legal action are muddy, to say the least, which might drag these discussions for years before there is actually any real impact on innovation.
Dramatical increase in funding for companies working on these problems – Hype usually means FOMO, which means more money for whoever wants to solve problems in the space (yes, even in the current macroeconomic situation). We are already seeing the early signals around this, with some companies raising some of the biggest seed rounds in history:
Jasper, the creator of a content platform for marketers, raised $125M on a $1.5B valuation
The tech will start being productized as features in existing products - Some of this tech has the potential to go into image and video editing software today. Companies like Runway are already creating brand-new products with this tech at their core. Incumbent companies like Adobe already started to include these tools in their software , i.e., Dall-E straight into Adobe Creative Cloud.
All of these areas will start to merge with cohesive results - I expect to see something happening around this in the next 12 to 18 months. At least some kind of PoC that will merge a minimum of 2 of these areas into something new, i.e., video + audio, or 3D + animation, etc.
Games, VR, and the Metaverse - I feel like the biggest potential for this technology is how much it can accelerate content creation (once quality is constant, which is still not the case). Games and 3D content are where I see the biggest problem that these models could solve. Think about the amount of time, resources, and money spent to create characters for a game, including conceptualizing, modeling, rigging, animating, etc. AI tools could make the creation of these huge game worlds more effective and efficient.
While we wait to know what will happen across this exciting space, I will keep researching and playing around with these technologies as much as I can. What will you create with these systems? What do you think the impacts of deploying them at scale are? Reach out to me and let me know.
Note: Meanwhile, I’ve created an Instagram account to share my Stable Diffusion creations with the web ????