Jordi Inglada

Posts tagged "surveillance-capitalism":

17 Apr 2023

Use ChatGPT and feed the technofeudal beast

tl;dr1: Big Tech is deploying AI assistants on their cloud work-spaces to help you work better, but are you sure that what you produce using these tools falls under your copyright and not that of your AI provider?

Intro

Unless you have been living under a rock for the last few months, you have heard about ChatGPT and you probably have played with it. You may even make part of those who have been using it for somewhat serious matters, like writing code, doing your homework, or writing reports for your boss.

There are many controversies around the use of these Large Language Models (LLM, ChatGPT is not the only one). Some people are worried about white collar workers being replaced by LLM as blue collar ones were by mechanical robots. Other people are concerned with the fact that these LLM have been trained on data under copyright. And yet another set of people may be worried about the use of LLM for massive fake information spreading.

All these concerns are legitimate2, but there is another one that I don't see addressed and that, from my point of view is, at least, as important as those above: the fact that these LLM are the ultimate stage of proprietary software and user lock-in. The final implementation of technofeudalism.

Yes, I know that ChatGPT has been developed by OpenAI (yes open) and they just want to do good for humankind:

OpenAI’s mission is to ensure that artificial general intelligence (AGI)—by which we mean highly autonomous systems that outperform humans at most economically valuable work—benefits all of humanity. We will attempt to directly build safe and beneficial AGI, but will also consider our mission fulfilled if our work aids others to achieve this outcome.

But remember, ChatGPT is not open source and it is not a common good or a public service, as opposed to Wikipedia, for instance.

The evolution of user lock-in in software

Proprietary software is software that comes with restrictions on what the user can do with it.

A long time ago, software used to come in a box, on floppy disks, cartridges or CD-ROM. The user put the disk on the computer and installed it. There was a time, were, even if the license didn't allow it, a user could lend the disk to friends so they could install it on their own computers. At that time, the only lock-in that the vendor could enforce was that of proprietary formats: the text editor you bought would store your writing in a format that only that software could write.

In order to avoid users sharing a single paid copy of the software, editors implemented artificial scarcity: a registration procedure. For the software to work, you needed to enter a code to unlock it. With the advent of the internet, the registration could be done online, and therefore, just one registration per physical box was possible. This was a second lock-in. Of course this was not enough, and editors implemented artificial obsolescence: expiring registrations. The software could stop working after a predefined period of time and a new registration was needed.

Up to this point, clever users always found a way to unlock the software, because software can be retro-engineered. In Newspeak, this is called pirating.

The next step was to move software to the cloud (i.e. the editor's computers). Now, you didn't install the software on your computer, you just connected to the editor's service and worked on your web browser. This allowed the editor to terminate your license at any time and even to implement pay-per-use plans. Added to that, your data was hosted on their servers. This is great, because you always have access to the most recent version of the software and your data is never lost if your computer breaks. This is also great for the editor, because they can see what you do with their software and have a peek at your data.

This software on the cloud has also allowed the development of the so-called collaborative work: several people editing the same document, for instance. This is great for you, because you have the feeling of being productive (don't worry, it's just a feeling). This is great for the editor, because they see how you work. Remember, you are playing a video-game on their computer and they store every action you do3, every interaction with your collaborators.

Now, with LLM, you are in heaven, because you have suggestions on how to write things, you can ask questions to the assistant and you are much more productive. This is also great for the editor, because they now not only know what you write, how you edit and improve your writing, but they also know what and how you think. Still better, your interactions with the LLM are used to improve it. This is a win-win situation, right?

How do LLM work?

The short answer is that nobody really knows. This is not because there is some secret, but rather because they are complex and AI researchers themselves don't have the mathematical tools to explain why they work. We can however still try to get a grasp of how they are implemented. You will find many resources online. Here I just show off a bit. You can skip this section.

LLM are a kind of artificial neural network, that is, a machine learning system that is trained with data to accomplish a set of tasks. LLM are trained with textual data, lots of textual data: the whole web (Wikipedia, forums, news sites, social media), all the books ever written, etc. The task they are trained to perform is just to predict the next set of words given a text fragment. You give the beginning of a sentence to the system and it has to produce the rest of it. You give it a couple of sentences and it has to finish the paragraph. You see the point.

What does "training a system" mean? Well, this kind of machine learning algorithm is just a huge mathematical function which implements millions of simple operations of the kind \(output = a \times input + b\). All these operations are plugged together, so that the output of the first operation becomes the input of the second, etc. In the case of a LLM, the very first output is the beginning of the sentence that the system has to complete. The very last output of the system is the "prediction" of the end of the sentence. You may wonder how you add or multiply sentences? Well, words and sentences are just transformed to numbers with some kind of dictionary, so that the system can do its operations. So to recap:

  1. you take text and transform it to sequences of numbers;
  2. you take these numbers and perform millions of operations like \(a\times input + b\) and combine them together;
  3. you transform back the output numbers into text to get the answer of the LLM.

Training the model means finding the millions of \(a\) and \(b\) (the parameters of the model) that perform the operations. If you don't train the model, the output is just rubbish. If you have enough time and computers, you can find the good set of parameters. Of course, there is a little bit of math behind all this, but you get the idea.

Once the model is trained, you can give it a text that it has never encountered (a prompt) and it will produce a plausible completion (the prediction). The impressive thing with these algorithms is that they do not memorize the texts used for training. They don't check a database of texts to generate predictions. They just apply a set of arithmetic operations that work on any input text. The parameters of the model capture some kind of high level knowledge about the language structure.

Well, actually, we don't know if LLM do not memorize the training data. LLM are to big and complex to be analyzed to check that. And this is one of the issues related to copyright infringement that some critics have raised. I am not interested in copyright here, but this impossibility to verify that training data can't appear at the output of a LLM will be of crucial importance later on.

Systems like ChatGPT, are LLM that have been fine-tuned to be conversational. The tuning has been done by humans preparing questions and answers so that the model can be improved. The very final step is RLHF: Reinforcement Learning from Human Feedback. Once the model has been trained and fine-tuned, they can be used and their answers can be rated by operators in order to avoid that the model spits out racist comments, sexist texts or anti-capitalist ideas. OpenAI who, remember, just want to do good to humankind abused under-payed operators for this task.

OK. Now you are an expert on LLM and you are not afraid of them. Perfect. Let's move on.

Technofeudalism

If producing a LLM is not so complex, why aren't we all training our own models? The reason is that training a model like GPT-3 (the engine behind ChatGPT) needs about 355 years on a single NVIDIA Tesla V100 GPU4, so if you want to do this in something like several weeks you need a lot of machines and a lot of electricity. And who has the money to do that? The big tech companies.

You may say that OpenAI is a non profit, but Wikipedia reminds us that Microsoft provided OpenAI with a $1 billion investment in 2019 and a second multi-year investment in January 2023, reported to be $10 billion. This is how LLM are now available in Github Copilot (owned by Microsoft), Bing and Office 365.

Of course, Google is doing the same and Meta may also be doing their own thing. So we have nothing to fear, since we have the choice. This is a free market.

Or maybe not. Feudal Europe was not really a free market for serfs. When only a handful of private actors have the means to build this kind of tools we become dependent on their will and interests. There is also something morally wrong here: LLM like ChatGPT have been trained on digital commons like Wikipedia or open source software, on all the books ever written, on the creations of individuals. Now, these tech giants have put fences around all that. Of course, we still have Wikipedia and all the rest, but now we are being told that we have to use these tools to be productive. Universities are changing their courses to teach programming with ChatGPT. Orange is the new black.

Or as Cédric Durand explains it:

The relationship between digital multinational corporations and the population is akin to that of feudal lords and serfs, and their productive behavior is more akin to feudal predation than capitalist competition.

[…]

Platforms are becoming fiefs not only because they thrive on their digital territory populated by data, but also because they exercise a power lock on services that, precisely because they are derived from social power, are now deemed indispensable.5

When your thoughts are not yours anymore

I said above that I am not interested in copyright, but you might be. Say that you use a LLM or a similar technology to generate texts. You can ask ChatGPT to write a poem or the lyrics of a song. You just give it a subject, some context and there it goes. This is lots of fun. The question now is: who is the author of the song? It's clearly not you. You didn't create it. If ChatGPT had some empathy, it could give you some kind of co-authorship, and you should be happy with that. You could even build a kind of Lennon-McCartney joint venture. You know, many of the Beatles' songs were either written by John or Paul, but they shared the authorship. They were good chaps. Until they weren't anymore.

You may have used Dall-E or Stable Diffusion. These are AI models which generate images from a textual description. Many people are using them (they are accessible online) to produce logos, illustrations, etc. The produced images contain digital watermarking. This is a kind of code which is inserted in the image and that is invisible to the eye. This allows to track authorship. This can be useful to prevent deep fakes. But this can also be used to ask you for royalties if ever the image generates revenue.

Watermarking texts seems impossible. At the end of the day, a text is just a sequence of characters, so you can't hide a secret code in it, right? Well, wrong. You can generate a text with some statistical patterns. But, OK, the technology does not seem mature and seems easy to break with just small modifications. You may therefore think that there is no risk of Microsoft or Google suing you for using their copyrighted material. Well, the sad thing is that you produced your song using their software on their computers (remember, SaaS, the cloud), so they have the movie showing what you did.

This may not be a problem if you are just writing a song for a party or producing an illustration for a meme. But many companies will soon start having employees (programmers, engineers, lawyers, marketers, writers, journalists) producing content with these tools on these platforms. Are they willing to share their profits with Microsoft? Did I tell you about technofeudalism?

Working for the man

Was this response better or worse? This is what ChatGPT asks you if you tell it to regenerate a response. You just have to check one box: better, worse, same. You did it? Great! You are a RLHFer6. This is why ChatGPT is free. Of course, it is free for the good of humankind, but you can help it to get even better! Thanks for your help. By the way, Github Copilot will also get better. Bing also will get better. And Office 365 too.

Thanks for helping us feed the beast.

Cheers.

Footnotes:

1

Abstract for those who have attention deficit as a consequence of consuming content on Twitter, Instragram and the like.

2

Well, at least if you believe in intellectual property, or think that bullshit jobs are useful or that politicians and companies tell the truth. One has to believe in something, I agree.

3

Yes, they do. This is needed to being able to revert changes in documents.

5

"My" translation. You have an interesting interview in English here https://nymag.com/intelligencer/2022/10/what-is-technofeudalism.html.

6

Reinforcement Learning from Human Feedback. I explained that above!

Tags: open-source free-software programming surveillance-capitalism llm
Other posts
Creative Commons License
jordiinglada.net by Jordi Inglada is licensed under a Creative Commons Attribution-ShareAlike 4.0 Unported License. RSS. Feedback: info /at/ jordiinglada.net Mastodon