Posts tagged "open-source":

17 Apr 2023

Use ChatGPT and feed the technofeudal beast

tl;dr¹: Big Tech is deploying AI assistants on their cloud work-spaces to help you work better, but are you sure that what you produce using these tools falls under your copyright and not that of your AI provider?

Intro

Unless you have been living under a rock for the last few months, you have heard about ChatGPT and you probably have played with it. You may even make part of those who have been using it for somewhat serious matters, like writing code, doing your homework, or writing reports for your boss.

There are many controversies around the use of these Large Language Models (LLM, ChatGPT is not the only one). Some people are worried about white collar workers being replaced by LLM as blue collar ones were by mechanical robots. Other people are concerned with the fact that these LLM have been trained on data under copyright. And yet another set of people may be worried about the use of LLM for massive fake information spreading.

All these concerns are legitimate², but there is another one that I don't see addressed and that, from my point of view is, at least, as important as those above: the fact that these LLM are the ultimate stage of proprietary software and user lock-in. The final implementation of technofeudalism.

Yes, I know that ChatGPT has been developed by OpenAI (yes open) and they just want to do good for humankind:

OpenAI’s mission is to ensure that artificial general intelligence (AGI)—by which we mean highly autonomous systems that outperform humans at most economically valuable work—benefits all of humanity. We will attempt to directly build safe and beneficial AGI, but will also consider our mission fulfilled if our work aids others to achieve this outcome.

But remember, ChatGPT is not open source and it is not a common good or a public service, as opposed to Wikipedia, for instance.

The evolution of user lock-in in software

Proprietary software is software that comes with restrictions on what the user can do with it.

A long time ago, software used to come in a box, on floppy disks, cartridges or CD-ROM. The user put the disk on the computer and installed it. There was a time, were, even if the license didn't allow it, a user could lend the disk to friends so they could install it on their own computers. At that time, the only lock-in that the vendor could enforce was that of proprietary formats: the text editor you bought would store your writing in a format that only that software could write.

In order to avoid users sharing a single paid copy of the software, editors implemented artificial scarcity: a registration procedure. For the software to work, you needed to enter a code to unlock it. With the advent of the internet, the registration could be done online, and therefore, just one registration per physical box was possible. This was a second lock-in. Of course this was not enough, and editors implemented artificial obsolescence: expiring registrations. The software could stop working after a predefined period of time and a new registration was needed.

Up to this point, clever users always found a way to unlock the software, because software can be retro-engineered. In Newspeak, this is called pirating.

The next step was to move software to the cloud (i.e. the editor's computers). Now, you didn't install the software on your computer, you just connected to the editor's service and worked on your web browser. This allowed the editor to terminate your license at any time and even to implement pay-per-use plans. Added to that, your data was hosted on their servers. This is great, because you always have access to the most recent version of the software and your data is never lost if your computer breaks. This is also great for the editor, because they can see what you do with their software and have a peek at your data.

This software on the cloud has also allowed the development of the so-called collaborative work: several people editing the same document, for instance. This is great for you, because you have the feeling of being productive (don't worry, it's just a feeling). This is great for the editor, because they see how you work. Remember, you are playing a video-game on their computer and they store every action you do³, every interaction with your collaborators.

Now, with LLM, you are in heaven, because you have suggestions on how to write things, you can ask questions to the assistant and you are much more productive. This is also great for the editor, because they now not only know what you write, how you edit and improve your writing, but they also know what and how you think. Still better, your interactions with the LLM are used to improve it. This is a win-win situation, right?

How do LLM work?

The short answer is that nobody really knows. This is not because there is some secret, but rather because they are complex and AI researchers themselves don't have the mathematical tools to explain why they work. We can however still try to get a grasp of how they are implemented. You will find many resources online. Here I just show off a bit. You can skip this section.

LLM are a kind of artificial neural network, that is, a machine learning system that is trained with data to accomplish a set of tasks. LLM are trained with textual data, lots of textual data: the whole web (Wikipedia, forums, news sites, social media), all the books ever written, etc. The task they are trained to perform is just to predict the next set of words given a text fragment. You give the beginning of a sentence to the system and it has to produce the rest of it. You give it a couple of sentences and it has to finish the paragraph. You see the point.

What does "training a system" mean? Well, this kind of machine learning algorithm is just a huge mathematical function which implements millions of simple operations of the kind $output = a \times input + b$. All these operations are plugged together, so that the output of the first operation becomes the input of the second, etc. In the case of a LLM, the very first output is the beginning of the sentence that the system has to complete. The very last output of the system is the "prediction" of the end of the sentence. You may wonder how you add or multiply sentences? Well, words and sentences are just transformed to numbers with some kind of dictionary, so that the system can do its operations. So to recap:

you take text and transform it to sequences of numbers;
you take these numbers and perform millions of operations like $a\times input + b$ and combine them together;
you transform back the output numbers into text to get the answer of the LLM.

Training the model means finding the millions of $a$ and $b$ (the parameters of the model) that perform the operations. If you don't train the model, the output is just rubbish. If you have enough time and computers, you can find the good set of parameters. Of course, there is a little bit of math behind all this, but you get the idea.

Once the model is trained, you can give it a text that it has never encountered (a prompt) and it will produce a plausible completion (the prediction). The impressive thing with these algorithms is that they do not memorize the texts used for training. They don't check a database of texts to generate predictions. They just apply a set of arithmetic operations that work on any input text. The parameters of the model capture some kind of high level knowledge about the language structure.

Well, actually, we don't know if LLM do not memorize the training data. LLM are to big and complex to be analyzed to check that. And this is one of the issues related to copyright infringement that some critics have raised. I am not interested in copyright here, but this impossibility to verify that training data can't appear at the output of a LLM will be of crucial importance later on.

Systems like ChatGPT, are LLM that have been fine-tuned to be conversational. The tuning has been done by humans preparing questions and answers so that the model can be improved. The very final step is RLHF: Reinforcement Learning from Human Feedback. Once the model has been trained and fine-tuned, they can be used and their answers can be rated by operators in order to avoid that the model spits out racist comments, sexist texts or anti-capitalist ideas. OpenAI who, remember, just want to do good to humankind abused under-payed operators for this task.

OK. Now you are an expert on LLM and you are not afraid of them. Perfect. Let's move on.

Technofeudalism

If producing a LLM is not so complex, why aren't we all training our own models? The reason is that training a model like GPT-3 (the engine behind ChatGPT) needs about 355 years on a single NVIDIA Tesla V100 GPU⁴, so if you want to do this in something like several weeks you need a lot of machines and a lot of electricity. And who has the money to do that? The big tech companies.

You may say that OpenAI is a non profit, but Wikipedia reminds us that Microsoft provided OpenAI with a $1 billion investment in 2019 and a second multi-year investment in January 2023, reported to be $10 billion. This is how LLM are now available in Github Copilot (owned by Microsoft), Bing and Office 365.

Of course, Google is doing the same and Meta may also be doing their own thing. So we have nothing to fear, since we have the choice. This is a free market.

Or maybe not. Feudal Europe was not really a free market for serfs. When only a handful of private actors have the means to build this kind of tools we become dependent on their will and interests. There is also something morally wrong here: LLM like ChatGPT have been trained on digital commons like Wikipedia or open source software, on all the books ever written, on the creations of individuals. Now, these tech giants have put fences around all that. Of course, we still have Wikipedia and all the rest, but now we are being told that we have to use these tools to be productive. Universities are changing their courses to teach programming with ChatGPT. Orange is the new black.

Or as Cédric Durand explains it:

The relationship between digital multinational corporations and the population is akin to that of feudal lords and serfs, and their productive behavior is more akin to feudal predation than capitalist competition.

[…]

Platforms are becoming fiefs not only because they thrive on their digital territory populated by data, but also because they exercise a power lock on services that, precisely because they are derived from social power, are now deemed indispensable.⁵

When your thoughts are not yours anymore

I said above that I am not interested in copyright, but you might be. Say that you use a LLM or a similar technology to generate texts. You can ask ChatGPT to write a poem or the lyrics of a song. You just give it a subject, some context and there it goes. This is lots of fun. The question now is: who is the author of the song? It's clearly not you. You didn't create it. If ChatGPT had some empathy, it could give you some kind of co-authorship, and you should be happy with that. You could even build a kind of Lennon-McCartney joint venture. You know, many of the Beatles' songs were either written by John or Paul, but they shared the authorship. They were good chaps. Until they weren't anymore.

You may have used Dall-E or Stable Diffusion. These are AI models which generate images from a textual description. Many people are using them (they are accessible online) to produce logos, illustrations, etc. The produced images contain digital watermarking. This is a kind of code which is inserted in the image and that is invisible to the eye. This allows to track authorship. This can be useful to prevent deep fakes. But this can also be used to ask you for royalties if ever the image generates revenue.

Watermarking texts seems impossible. At the end of the day, a text is just a sequence of characters, so you can't hide a secret code in it, right? Well, wrong. You can generate a text with some statistical patterns. But, OK, the technology does not seem mature and seems easy to break with just small modifications. You may therefore think that there is no risk of Microsoft or Google suing you for using their copyrighted material. Well, the sad thing is that you produced your song using their software on their computers (remember, SaaS, the cloud), so they have the movie showing what you did.

This may not be a problem if you are just writing a song for a party or producing an illustration for a meme. But many companies will soon start having employees (programmers, engineers, lawyers, marketers, writers, journalists) producing content with these tools on these platforms. Are they willing to share their profits with Microsoft? Did I tell you about technofeudalism?

Working for the man

Was this response better or worse? This is what ChatGPT asks you if you tell it to regenerate a response. You just have to check one box: better, worse, same. You did it? Great! You are a RLHFer⁶. This is why ChatGPT is free. Of course, it is free for the good of humankind, but you can help it to get even better! Thanks for your help. By the way, Github Copilot will also get better. Bing also will get better. And Office 365 too.

Thanks for helping us feed the beast.

Cheers.

Footnotes:

Abstract for those who have attention deficit as a consequence of consuming content on Twitter, Instragram and the like.

Well, at least if you believe in intellectual property, or think that bullshit jobs are useful or that politicians and companies tell the truth. One has to believe in something, I agree.

Yes, they do. This is needed to being able to revert changes in documents.

⁴

https://www.reddit.com/r/GPT3/comments/p1xf10/how_many_days_did_it_take_to_train_gpt3_is/

⁵

"My" translation. You have an interesting interview in English here https://nymag.com/intelligencer/2022/10/what-is-technofeudalism.html.

⁶

Reinforcement Learning from Human Feedback. I explained that above!

15 Apr 2023

Modernisation du SI, suites bureautiques dans le nuage et protection des secrets industriels

Les DSI¹ des moyennes et grandes entreprises sont fatiguées de devoir maintenir et mettre à jour des centaines ou des milliers de postes de travail alors que l’informatique n’est pas cœur de métier de la boîte². En plus, les informaticiens compétents ayant une bonne hygiène corporelle sont difficiles à trouver.

La réponse est évidemment de se tourner vers les « solutions » en ligne, le SaaS (Software as a Service), le cloud. C’est simple : le poste utilisateur peut se limiter à un terminal avec un navigateur internet qui se connectera aux serveurs de Microsoft ou de Google (les 2 fournisseurs principaux). Plus besoin d’installation locale de suite Office, plus besoin de mise à jour ou de montée en version, plus besoin de sauvegardes. C’est le fournisseur qui s’occupe de tout, les logiciels tournent chez lui et les documents produits sont aussi hébergés sur ses machines.

C’est simple, c’est dans le nuage : c’est la modernisation du SI !

Cependant, certaines entreprises sont trop à cheval sur la propriété intellectuelle, la souveraineté numérique et même la PPST ou le RGPD. Elles pourraient donc être réticentes à adopter ce type de solution. En effet, il peut être délicat de stocker des informations confidentielles sur des serveurs que l’on ne maîtrise pas. Mais comment rester compétitif face à ceux qui sont disruptifs, bougent vite et qui ne se laissent pas embêter par des amish ?

D’après les fournisseurs, il n’y a pas de crainte à avoir, car les solutions proposées permettent d’activer le chiffrement côté client. Ceci veut dire que les flux de données entre le poste utilisateur et les serveurs du fournisseur sont chiffrés cryptographiquement par une clé que seulement le client possède. La conséquence est que les documents stockés chez le fournisseur sont illisibles pour lui, car il n’a pas la clé.

Nous voilà rassurés. C’est sans doute pour cela que Airbus utilise Google Workspace³ ou que Thales utilise Microsoft 365. Par ailleurs, Thales est partenaire de Microsoft pour le développement des techniques de chiffrage. Cocorico !

Du coup, les appels d’offres de l’État français pour des suites bureautiques souveraines semblent ne pas avoir de sens. Surtout, qu’il y a aussi des partenariats franco-américains de « cloud de confiance » qui vont bientôt voir le jour. Mais il reste des questions à se poser. Par exemple, on peut lister ces points de vigilance :

Le chiffrement côté client n’est pas fiable. J’y reviens ci-dessous.
L’extraterritorialité du droit américain peut imposer aux fournisseurs américains de mettre en œuvre « un espionnage paré des vertus de la légalité ».
Le fournisseur peut fermer l’accès du client à tout moment : plus d’e-mail, plus de tableur excel, plus de ChatGPT qui nous aide à écrire des rapports que personne ne lit.

Pourquoi le chiffrement côté client n’est-il pas fiable ? Pour commencer, ce chiffrement n’est pas complet : beaucoup d’informations sont accessibles en clair pour le fournisseur. Par exemple, les noms des documents, les entêtes des e-mails, les listes des participants à des réunions, etc. Les flux vidéo et audio seraient chiffrés, mais, même le fournisseur officiel de chiffrement pour Microsoft écrit ça dans les invitations à des visioconférences sur Teams :

Reminder: No content above "THALES GROUP LIMITED DISTRIBUTION” / “THALES GROUP INTERNAL" and no country eyes information can be discussed/presented on Teams.

Petite explication de « country eyes » par ici⁴.

Mais il y a un autre aspect plus intéressant concernant le chiffrement côté client. Ce chiffrement est fait par un logiciel du fournisseur pour lequel le client n’a pas le code source. Donc il n’y a aucune garantie que ce chiffrement est fait ou qu’il est fait proprement. On pourrait rétorquer que les logiciels peuvent être audités : le fournisseur vous laisse regarder le code ou il vous permet d’étudier le flux réseau pour que vous soyez rassuré. On peut répondre : dieselgate, c’est-à-dire, il n’y a aucune garantie que le logiciel se comporte de la même façon pendant l’audit que le reste du temps. De plus, vu que le fournisseur gère les mises à jour du logiciel, il n’y a pas de garantie que la version auditée corresponde à celle qui est vraiment utilisée.

Il semblerait donc que la vraie « solution » soit d’héberger ses propres données et logiciels et que le code source de ces derniers soit disponible. Le cloud suzerain⁵ n’est pas une fatalité.

Footnotes:

Direction des Systèmes d’Information, le département informatique.

Il y a aussi des entreprises dont l’informatique est le cœur de métier qui font aussi ce que je décris par la suite.

On a fait un long chemin depuis que Airbus portait plainte pour espionnage industriel de la part des USA. Depuis Snowden, on sait que la NSA s’abreuve directement chez les GAFAM.

⁴

"A surveillance alliance is an agreement between multiple countries to share their intelligence with one another. This includes things like the browsing history of any user that is of interest to one of the member countries. Five eyes and Fourteen eyes are two alliances of 5 and 14 countries respectively that agree to share information with one another wherever it is mutually beneficial. They were originally made during the cold war as a means of working together to overcome a common enemy (the soviet union) but still exist today." https://www.securitymadesimple.org/cybersecurity-blog/fourteen-eyes-surveillance-explained

⁵

Non, pas de coquille ici.

27 May 2015

Installing OTB has never been so easy

You've heard about the Orfeo Toolbox library and its wonders, but urban legends say that it is difficult to install. Don't believe that. Maybe it was difficult to install, but this is not the case anymore.

Thanks to the heroic work of the OTB core development team, installing OTB has never been so easy. In this post, you will find the step-by-step procedure to compile OTB from source on a Debian 8.0 Jessie GNU/Linux distribution.

Prepare the user account

I assume that you have a fresh install. The procedure below has been tested in a virtual machine running under VirtualBox. The virtual machine was installed from scratch using the official netinst ISO image.

During the installation, I created a user named otb that I will use for the tutorial below. For simplicity, I give this user root privileges in order to install some packages. This can be done as follows. Log in as root or use the command:

su -

You can then edit the /etc/sudoers file by using the following command:

visudo

This will open the file with the nano text editor. Scroll down to the lines containing

# User privilege specification
root    ALL=(ALL:ALL) ALL

and copy the second line and below and replace root by otb:

otb     ALL=(ALL:ALL) ALL

Write the file and quit by doing C^o ENTER C^x. Log out and log in as otb. You are set!

System dependencies

Now, let's install some packages needed to compile OTB. Open a terminal and use aptitude to install what we need:

sudo aptitude install mercurial \
     cmake-curses-gui build-essential \
     qt4-dev-tools libqt4-core \
     libqt4-dev libboost1.55-dev \
     zlib1g-dev libopencv-dev curl \
     libcurl4-openssl-dev swig \
     libpython-dev

Get OTB source code

We will install OTB in its own directory. So from your $HOME directory create a directory named OTB and go into it:

mkdir OTB
cd OTB

Now, get the OTB sources by cloning the repository (depending on your network speed, this may take several minutes):

hg clone http://hg.orfeo-toolbox.org/OTB

This will create a directory named OTB (so in my case, this is /home/otb/OTB/OTB).

Using mercurial commands, you can choose a particular version or you can go bleeding edge. You will at least need the first release candidate for OTB-5.0, which you can get with the following commands:

cd OTB
hg update 5.0.0-rc1
cd ../

Get OTB dependencies

OTB's SuperBuild is a procedure which deals with all external libraries needed by OTB which may not be available through your Linux package manager. It is able to download source code, configure and install many external libraries automatically.

Since the download process may fail due to servers which are not maintained by the OTB team, a big tarball has been prepared for you. From the $HOME/OTB directory, do the following:

wget https://www.orfeo-toolbox.org/packages/SuperBuild-archives.tar.bz2
tar xvjf SuperBuild-archives.tar.bz2

The download step can be looooong. Be patient. Go jogging or something.

Compile OTB

Once you have downloaded and extracted the external dependencies, you can start compiling OTB. From the $HOME/OTB directory, create the directory where OTB will be built:

mkdir -p SuperBuild/OTB

At the end of the compilation, the $HOME/OTB/SuperBuild/ directory will contain a classical bin/, lib/, include/ and share/ directory tree. The $HOME/OTB/SuperBuild/OTB/ is where the configuration and compilation of OTB and all the dependencies will be stored.

Go into this directory:

cd SuperBuild/OTB

Now we can configure OTB using the cmake tool. Since you are on a recent GNU/Linux distribution, you can tell the compiler to use the most recent C++ standard, which can give you some benefits even if OTB still does not use it. We will also compile using the Release option (optimisations). The Python wrapping will be useful with the OTB Applications. We also tell cmake where the external dependencies are. The options chosen below for OpenJPEG make OTB use the gdal implementation.

cmake \
    -DCMAKE_CXX_FLAGS:STRING=-std=c++14 \
    -DOTB_WRAP_PYTHON:BOOL=ON \
    -DCMAKE_BUILD_TYPE:STRING=Release \
    -DCMAKE_INSTALL_PREFIX:PATH=/home/otb/OTB/SuperBuild/ \
    -DDOWNLOAD_LOCATION:PATH=/home/otb/OTB/SuperBuild-archives/ \
    -DOTB_USE_OPENJPEG:BOOL=ON \
    -DUSE_SYSTEM_OPENJPEG:BOOL=OFF \
    ../../OTB/SuperBuild/

After the configuration, you should be able to compile. I have 4 cores in my machine, so I use the -j4 option for make. Adjust the value to your configuration:

make -j4

This will take some time since there are many libraries which are going to be built. Time for a marathon.

Test your installation

Everything should be compiled and available now. You can set up some environment variables for an easier use of OTB. You can for instance add the following lines at the end of $HOME/.bashrc:

export OTB_HOME=${HOME}/OTB/SuperBuild
export PATH=${OTB_HOME}/bin:$PATH
export LD_LIBRARY_PATH=${OTB_HOME}/lib

You can now open a new terminal for this to take effect or use:

cd
source .bashrc

You should now be able to use the OTB Applications. For instance, the command:

otbcli_BandMath

should display the documentation for the BandMath application.

Another way to run the applications, is using the command line application launcher as follows:

otbApplicationLauncherCommandLine BandMath $OTB_HOME/lib/otb/applications/

Conclusion

The SuperBuild procedure allows to easily install OTB without having to deal with different combinations of versions for the external dependencies (TIFF, GEOTIFF, OSSIM, GDAL, ITK, etc.).

This means that once you have cmake and a compiler, you are pretty much set. QT4 and Python are optional things which will be useful for the applications, but they are not required for a base OTB installation.

I am very grateful to the OTB core development team (Julien, Manuel, Guillaume, the other Julien, Mickaël, and maybe others that I forget) for their efforts in the work done for the modularisation and the development of the SuperBuild procedure. This is the kind of thing which is not easy to see from the outside, but makes OTB go forward steadily and makes it a very mature and powerful software.

13 May 2015

A simple thread pool to run parallel PROSail simulations

In the otb-bv we use the OTB versions of the Prospect and Sail models to perform satellite reflectance simulations of vegetation.

The code for the simulation of a single sample uses the ProSail simulator configured with the satellite relative spectral responses, the acquisition parameters (angles) and the biophysical variables (leaf pigments, LAI, etc.):

ProSailType prosail;
prosail.SetRSR(satRSR);
prosail.SetParameters(prosailPars);
prosail.SetBVs(bio_vars);
auto result = prosail();

A simulation is computationally expensive and it would be difficult to parallelize the code. However, if many simulations are going to be run, it is worth using all the available CPU cores in the machine.

I show below how using C++11 standard support for threads allows to easily run many simulations in parallel.

Each simulation uses a set of variables given in an input file. We parse the sample file in order to get the input parameters for each sample and we construct a vector of simulations with the appropriate size to store the results.

otbAppLogINFO("Processing simulations ..." << std::endl);
auto bv_vec = parse_bv_sample_file(m_SampleFile);
auto sampleCount = bv_vec.size();
otbAppLogINFO("" << sampleCount << " samples read."<< std::endl);
std::vector<SimulationType> simus{sampleCount};

The simulation function is actually a lambda which will sequentially process a sequence of samples and store the results into the simus vector. We capture by reference the parameters which are the same for all simulations (the satellite relative spectral responses satRSR and the acquisition angles in prosailPars):

auto simulator = [&](std::vector<BVType>::const_iterator sample_first,
                     std::vector<BVType>::const_iterator sample_last,
                     std::vector<SimulationType>::iterator simu_first){
  ProSailType prosail;
  prosail.SetRSR(satRSR);
  prosail.SetParameters(prosailPars);
  while(sample_first != sample_last)
    {
    prosail.SetBVs(*sample_first);
    *simu_first = prosail();
    ++sample_first;
    ++simu_first;
    }
};

We start by figuring out how to split the simulation into concurrent threads. How many cores are there?

auto num_threads = std::thread::hardware_concurrency();
otbAppLogINFO("" << num_threads << " CPUs available."<< std::endl);

So we define the size of the chunks we are going to run in parallel and we prepare a vector to store the threads:

auto block_size = sampleCount/num_threads;
if(num_threads>=sampleCount) block_size = sampleCount;
std::vector<std::thread> threads(num_threads);

Here, I choose to use as many threads as cores available, but this could be changed by a multiplicative factor if we know, for instance that disk I/O will introduce some idle time for each thread.

An now we can fill the vector with the threads that will process every block of simulations :

auto input_start = std::begin(bv_vec);
auto output_start = std::begin(simus);
for(size_t t=0; t<num_threads; ++t)
  {
  auto input_end = input_start;
  std::advance(input_end, block_size);
  threads[t] = std::thread(simulator,
                           input_start,
                           input_end,
                           output_start);
  input_start = input_end;
  std::advance(output_start, block_size);
  }

The std::thread takes the name of the function object to be called, followed by the arguments of this function, which in our case are the iterators to the beginning and the end of the block of samples to be processed and the iterator of the output vector where the results will be stored. We use std::advance to update the iterator positions.

After that, we have a vector of threads which have been started concurrently. In order to make sure that they have finished before trying to write the results to disk, we call join on each thread, which results in waiting for each thread to end:

std::for_each(threads.begin(),threads.end(),
              std::mem_fn(&std::thread::join));
otbAppLogINFO("" << sampleCount << " samples processed."<< std::endl);
for(const auto& s : simus)
  this->WriteSimulation(s);

This may no be the most efficient solution, nor the most general one. Using std::async and std::future would have allowed not to have to deal with block sizes, but in this solution we can easily decide the number of parallel threads that we want to use, which may be useful in a server shared with other users.

18 Mar 2015

Data science in C?

Coursera is offering a Data Science specialization taught by professors from Johns Hopkins. I discovered it via one of their courses which is about reproducible research. I have been watching some of the video lectures and they are very interesting, since they combine data processing, programming and statistics.

The courses use the R language which is free software and is one the reference tools in the field of statistics, but it is also very much used for other kinds of data analysis.

While I was watching some of the lectures, I had some ideas to be implemented in some of my tools. Although I have used GSL and VXL¹ for linear algebra and optimization, I have never really needed statistics libraries in C or C++, so I ducked a bit and found the apophenia library², which builds on top of GSL, SQLite and Glib to provide very useful tools and data structures to do statistics in C.

Browsing a little bit more, I found that the author of apophenia has written a book "Modeling with data"³, which teaches you statistics like many books about R, but using C, SQLite and gnuplot.

This is the kind of technical book that I like most: good math and code, no fluff, just stuff!

The author sets the stage from the foreword. An example from page xii (the emphasis is mine):

" The politics of software

All of the software in this book is free software, meaning that it may be freely downloaded and distributed. This is because the book focuses on portability and replicability, and if you need to purchase a license every time you switch computers, then the code is not portable. If you redistribute a functioning program that you wrote based on the GSL or Apophenia, then you need to redistribute both the compiled final program and the source code you used to write the program. If you are publishing an academic work, you should be doing this anyway. If you are in a situation where you will distribute only the output of an analysis, there are no obligations at all. This book is also reliant on POSIX-compliant systems, because such systems were built from the ground up for writing and running replicable and portable projects. This does not exclude any current operating system (OS): current members of the Microsoft Windows family of OSes claim POSIX compliance, as do all OSes ending in X (Mac OS X, Linux, UNIX,…)."

Of course, the author knows the usual complaints about programming in C (or C++ for that matter) and spends many pages explaining his choice:

"I spent much of my life ignoring the fundamentals of computing and just hacking together projects using the package or language of the month: C++, Mathematica, Octave, Perl, Python, Java, Scheme, S-PLUS, Stata, R, and probably a few others that I’ve forgotten. Albee (1960, p 30)⁴ explains that “sometimes it’s necessary to go a long distance out of the way in order to come back a short distance correctly;” this is the distance I’ve gone to arrive at writing a book on data-oriented computing using a general and basic computing language. For the purpose of modeling with data, I have found C to be an easier and more pleasant language than the purpose-built alternatives—especially after I worked out that I could ignore much of the advice from books written in the 1980s and apply the techniques I learned from the scripting languages."

The author explains that C is a very simple language:

" Simplicity

C is a super-simple language. Its syntax has no special tricks for polymorphic operators, abstract classes, virtual inheritance, lexical scoping, lambda expressions, or other such arcana, meaning that you have less to learn. Those features are certainly helpful in their place, but without them C has already proven to be sufficient for writing some impressive programs, like the Mac and Linux operating systems and most of the stats packages listed above."

And he makes it really simple, since he actually teaches you C in one chapter of 50 pages (and 50 pages counting source code is not that much!). He does not teach you all C, though:

"As for the syntax of C, this chapter will cover only a subset. C has 32 keywords and this book will only use 18 of them."

At one point in the introduction I worried about the author bashing C++, which I like very much, but he actually gives a good explanation of the differences between C and C++ (emphasis and footnotes are mine):

"This is the appropriate time to answer a common intro-to-C question: What is the difference between C and C++? There is much confusion due to the almost-compatible syntax and similar name—when explaining the name C-double-plus⁵, the language’s author references the Newspeak language used in George Orwell’s 1984 (Orwell, 1949⁶; Stroustrup, 1986, p 4⁷). The key difference is that C++ adds a second scope paradigm on top of C’s file- and function-based scope: object-oriented scope. In this system, functions are bound to objects, where an object is effectively a struct holding several variables and functions. Variables that are private to the object are in scope only for functions bound to the object, while those that are public are in scope whenever the object itself is in scope. In C, think of one file as an object: all variables declared inside the file are private, and all those declared in a header file are public. Only those functions that have a declaration in the header file can be called outside of the file. But the real difference between C and C++ is in philosophy: C++ is intended to allow for the mixing of various styles of programming, of which object-oriented coding is one. C++ therefore includes a number of other features, such as yet another type of scope called namespaces, templates and other tools for representing more abstract structures, and a large standard library of templates. Thus, C represents a philosophy of keeping the language as simple and unchanging as possible, even if it means passing up on useful additions; C++ represents an all-inclusive philosophy, choosing additional features and conveniences over parsimony."

It is actually funny that I find myself using less and less class inheritance and leaning towards small functions (often templates) and when I use classes, it is usually to create functors. This is certainly due to the influence of the Algorithmic journeys of Alex Stepanov.

Footnotes:

Actually the numerics module, vnl.

Apophenia is the experience of perceiving patterns or connections in random or meaningless data.

Ben Klemens, Modeling with Data: Tools and Techniques for Scientific Computing, Princeton University Press, 2009, ISBN: 9780691133140.

⁴

Albee, Edward. 1960. The American Dream and Zoo Story. Signet.

⁵

See http://en.wikipedia.org/wiki/List_of_Newspeak_words#Prefixes

⁶

Orwell, George. 1949. 1984. Secker and Warburg.

⁷

Stroustrup, Bjarne. 1986. The C++ Programming Language. Addison-Wesley.

26 Nov 2014

Is open and free global land cover mapping possible?

Short answer: yes.

Mid November took place in Toulouse "Le Capitole du libre", a conference on Free Software and Free Culture. The program this year was again full of interesting talks and workshops.

This year, I attended a workshop about contributing to Openstreetmap (OSM) using the JOSM software. The workshop was organised by Sébastien Dinot who is a massive contributor to OSM, and more importantly a very nice and passionate fellow.

I was very happy to learn to use JOSM and did 2 minor contributions right there.

During the workshop I learned that, over the past, OSM has been enriched using massive imports from open data sources, like for instance cadastral data bases from different countries or the Corine Land Cover data base. This has been possible thanks to the policies of many countries which have understood that the commons are important for the advancement of society. One example of this is the European INSPIRE initiative.

I was also interested to learn that what could be considered niche data, like agricultural land parcel data bases as for instance the French RPG have also been imported into OSM. Since I have been using the RPG at work for the last 4 years (see for example here or here), I was sympathetic with the difficulties of OSM contributors to efficiently exploit these data. I understood that the Corine Land Cover import was also difficult and the results were not fully satisfactory.

As a matter of fact, roads, buildings and other cartographic objects are easier to map than land cover, since they are discrete and sparse. They can be pointed, defined and characterised more easily than natural and semi-natural areas.

After that, I could not avoid making the link with what we do at work in terms of preparing the exploitation of upcoming satellite missions for automatic land cover map production.

One of our main interests is the use of Sentinel-2 images. It is the week end while I am writing this, so I will not use my free time to explain how land cover map production from multi-temporal satellite images work: I already did it in my day job.

What is therefore the link between what we do at work and OSM? The revolutionary thing from my point of view is the fact that Sentinel-2 data will be open and free, which means that the OSM project could use it to have a constantly up to date land cover layer.

Of course, Sentinel-2 data will come in huge volumes and a good amount of expertise will be needed to use them. However, several public agencies are paving the road in order to deliver data which is easy to use. For instance, the THEIA Land Data Centre will provide Sentinel-2 data which is ready to use for mapping. The data will be available with all the geometric and radiometric corrections of the best quality.

Actually, right now this is being done, for instance, for Landsat imagery. Of course, all these data is and will be available under open and free licences, which means that anyone can start right now learning how to use them.

However, going from images to land cover maps is not straightforward. Again, a good deal of expertise and efficient tools are needed in order to convert pixels into maps. This is what I have the chance to do at work: building tools to convert pixels into maps which are useful for real world applications.

Applying the same philosophy to tools as for data, the tools we produce are free and open. The core of all these tools is of course the Orfeo Toolbox, the Free Remote Sensing Image Processing Library from CNES. We have several times demonstrated that the tools are ready to efficiently exploit satellite imagery to produce maps. For instance, in this post here you even have the sequence of commands to generate land cover maps using satellite image time series.

This means that we have free data and free tools. Therefore, the complete pipeline is available for projects like OSM. OSM contributors could start right now getting familiar with these data and these tools.

Head over to CNES servers to get some Landsat data, install OTB, get familiar with the classification framework and see what could be done for OSM.

It is likely that some pieces may still be missing. For instance, the main approach for the map production is supervised classification. This means that we use machine learning algorithms to infer which land cover class is present at every given site using the images as input data. For these machine learning algorithms to work, we need training data, that is, we need to know before hand the correct land cover class in some places so the algorithm can be calibrated.

This training data is usually called ground truth and it is expensive and difficult to get. In a global mapping context, this can be a major drawback. However, there are interesting initiatives which could be leveraged to help here. For instance, Geo-Wiki comes to mind as a possible source of training data.

As always, talk is cheap, but it seems to me that exciting opportunities are available for open and free quality global mapping. This does not mean that the task is easy. It is not. There are many issues to be solved yet and some of them are at the research stage. But this should not stop motivated mappers and hackers to start learning to use the data and the tools.

27 Oct 2014

xdg-open or "computer, do what I mean"

Command line nerds¹ say that graphical user interfaces are less efficient since the former use the keyboard and the latter tend to need the use of the mouse.

Graphical user interface fans say that command line based work-flows need the user to know more things about the system. Although I don't see why this would be an issue (what's the problem with getting to know better the system you use every day?), I can see one clear advantage to the point-and-click work-flow.

When the user wants to open a file, double-clicking on it just works. On the other hand, on the command line, the user needs to call the appropriate application and pass the file name as argument:

evince my_document.pdf

That means that the user has to know which is the application that is going to correctly deal with the given file. For example, this won't work:

evince my_movie.mp4

because evince is a document viewer and it only understands formats as pdf, postscript, dejavu or dvi, but it is not able to understand MPEG-4 videos.

So how can one replicate the point-and-click behavior on the command line? It would be nice to have something like:

open my_file

and that auto-magically the appropriate application was chosen.

It is of course possible, and it can be done in the same way as the graphical desktop environment does it: using xdg-utils.

Inside a desktop environment (e.g. GNOME, KDE, or Xfce), xdg-open simply passes the arguments to that desktop environment's file-opener application (gvfs-open, kde-open, or exo-open, respectively), which means that the associations are left up to the desktop environment. Therefore, on the command line, one can do:

xdg-open my_file.pdf

and the appropriate application will be used (evince on GNOME, okular on KDE, etc.).

When no desktop environment is detected (for example when one runs a standalone window manager, e.g. stumpwm), xdg-open will use its own configuration files.

Sometimes, the default associations between applications and file types may not suit the user. For instance, in my case, MPEG-4 videos were open with mplayer, but I prefer vlc. The xdg-mime tool is meant to help you change the default associations:

xdg-mime default vlc.desktop video/mp4

The vlc.desktop parameter is an xdg desktop file which describes the vlc applications. On my Debian GNU/Linux system, this files are located in /usr/share/applications. So I was able to look in there to see what applications xdg-open could use.

The parameter video/mp4 passed to xdg-mime is the type of file we want to associate the application with. In order to know what type of file we are dealing with, xdg-mime can query it for you:

xdg-mime query my_movie.mp4

And the answer is:

video/mp4

Now, what happens if there is no desktop file for an application I want to use? For instance, remote sensing image visualization is not possible with the standard image viewers available on GNU/Linux. I personally use the OTB IceViewer, which is a lightweight application using the rendering engine developed for Monteverdi2.

In this case, I have to create a desktop file for the application. I my case, I have created the otbice.desktop file in /home/inglada/.local/share/applications/ with the following contents:

[Desktop Entry]
Type=Application
Name=OTB Ice Viewer
Exec=/home/inglada/OTB/builds/Ice/bin/otbiceviewer %U
Icon=otb-logo
Terminal=false
Categories=Graphics;2DGraphics;RasterGraphics;
MimeType=image/bmp;image/gif;image/x-portable-bitmap;image/x-portable-graymap;image/x-portable-pixmap;image/x-xbitmap;image/tiff;image/jpeg;image/png;image/x-xpixmap;image/jp2;image/jpeg2000;

The file has to be executable, so I have to do:

chmod +x /home/inglada/.local/share/applications/otbice.desktop

After that, I can associate the IceViewer with the image formats I want:

xdg-mime default otbice.desktop image/tiff

And it just works.

Footnotes:

And I am one. See here, here and here for example.

04 Jul 2014

Two misconceptions about open source

There are a couple of ideas which circulate about open source¹ projects which I think are misconceptions about how software is developed.

Throw it over the wall

The first one is about criticisms of projects where the developers publish the source code only for released versions instead of maintaining a public version control system. That is, tarballs are made available when the software is considered to be ready, instead of pushing all the changes in real time to services like Github or Bitbucket.

This started to annoy me when a couple of years ago, some bloggers and podcasters started to say that Android is not really open source, since the development is not done in the open. I am not at all a Google fanboy, quite the opposite actually, but I understand that both Google and the individual lone hacker in a basement, may want to fully control how their code evolves between releases without external interferences.

I personally don't do things like this, but if the source code is available, it is open source. It may not be a community project, and some may find ugly to throw code over the wall like this, but this is another thing.

Eyeballs and shallow bugs

The recent buzz about the Heartbleed bug has put back under the spot what Eric S. Raymond called Linus' Law:

Given enough eyeballs, all bugs are shallow.

Raymond develops is as:

Given a large enough beta-tester and co-developer base, almost every problem will be characterized quickly and the fix obvious to someone.

Right after Heartbleed, many voices in the inter-webs took it as a counterexample of Linus' Law. Their reasoning was that since the bug was introduced more than 2 years before, the fact that the source code was open didn't help to catch it quickly.

I think that the argument is flawed if we take into account the 3 steps involved in debugging software:

Identify that a bug exists: some user (which may be one of the developers) detects a misbehavior in the software. This step may involve characterizing the bug in terms of being able to reproduce the wrong behavior.
Find the bug: track down the issue until a limited piece of code is identified as containing the source of the problem found.
Fix the bug: modify the source code in order to obtain the correct or desired behavior.

Usually, the second and third steps are the hardest, but they can't be taken until the bug is detected. After that, Linus' Law comes into play, and the more people looking at the source code, the more likely is for the bug to be found and fixed.

I don't like criticizing Wikipedia, but this time I think that the misunderstanding may come from explanations as the one in the Linux' Law Wikipedia page about Raymond's text.

Footnotes:

I will use the term open source instead of free software because here the important point here is that the source code is available, not the particular license under which it is distributed.

03 Jan 2011

Reproducible research

I have recently implemented the Spectral Rule Based Landsat TM image classifier described in ¹.

This paper proposes a set of radiometric combinations, thresholds and logic rules to distinguish more than 40 spectral categories on Landsat images. My implementation is available in the development version of the Orfeo Toolbox and should be included in the next release:

One interesting aspect of the paper is that all the information needed for the implementation of the method is given: every single value for thresholds, indexes, etc. is written down in the paper. This was really useful for me, since I was able to code the whole system without getting stuck on unclear things or hidden parameters. This is so rarely found in image processing literature that I thought it was worth to post about it. But this is not all. Once my implementation was done, I was very happy to get some Landsat classifications, but I was not able to decide whether the results were correct or not. Since the author of the paper seemed to want his system to be used and gave all details for the implementation, I thought I would ask him for help for the validation. So I sent an e-mail to A. Baraldi (whom I had already met before) and asked for some validation data (input and output images generated by his own implementation). I got something better than only images. He was kind enough to send me the source code of the very same version of the software which was used for the paper – the system continues to be enhanced and the current version seems to be far better than the one published. So now I have all what is needed for reproducible research:

A clear description of the procedure with all the details needed for the implementation.
Data in order to run the experiments.
The source code so that errors can be found and corrected.

I want to publicly thank A. Baraldi for his kindness and I hope that this way of doing science will continue to grow. If you want to know more about reproducible research, check this site.

Footnotes:

Baraldi et al. 2006, "Automatic Spectral Rule-Based PreliminaryMapping of Calibrated Landsat TM and ETM+ Images", IEEE Trans. on Geoscience and Remote Sensing, vol 44, no 9.

01 Jul 2010

Community building

In my last post I wrote about open approaches for Earth Observation. Open (science/source/whatever) means community. I am in the middle of the reading of "The Art of Community" and it is very inspiring. In some ways, it reminds me of very much of Chapter 4 "Social and Political Infrastructure" of Karl Fogel's "Producing Open Source Software - How to Run a Successful Free Software Project", although Bacon's book is more general.

Anyway, both books give very good insight on the issues and tricks involved in a community-based project and even the particular case of projects which are created and managed by companies.

This is an interesting point, since one could think that a project nfunded by a company has more chances to succeed than a pure volunteer-based one, but many real examples show that this is not the case. The main pitfalls of a corporate open source project are related to decision making and communication. For a project to succeed, the community has to be respected and open discussions and meritocracy are the 2 pillars of community.

Examples of this are the fact that all decisions involving the development have to be explained and discussed in a way that all developers are involved. Examples of closed discussions at the company level leaving out the volunteer contributors are given.

Another interesting example is the one given by Fogel about meritocracy: any developer should earn the write access to the source repository by proving his value. Usually, in corporate environments, developers have commit access from the first day, while external contributors, which usually are much more capable, have to wait long time before that.

An important player in any open source project is the Benevolent Dictator (BD). I will cite Fogel verbatim here:

"It is common for the benevolent dictator to be a founder of the project, but this is more a correlation than a cause. The sorts of qualities that make one able to successfully start a project—technical competence, ability to persuade other people to join, etc.—are exactly the qualities any BD would need. And of course, founders start out with a sort of automatic seniority, which can often be enough to make benevolent dictatorship appear the path of least resistance for all concerned.

Remember that the potential to fork goes both ways. A BD can fork a project just as easily as anyone else, and some have occasionally done so, when they felt that the direction they wanted to take the project was different from where the majority of other developers wanted to go. Because of forkability, it does not matter whether the benevolent dictator has root (system administrator privileges) on the project's main servers or not. People sometimes talk of server control as though it were the ultimate source of power in a project, but in fact it is irrelevant. The ability to add or remove people's commit passwords on one particular server affects only the copy of the project that resides on that server. Prolonged abuse of that power, whether by the BD or someone else, would simply lead to development moving to a different server.

Whether your project should have a benevolent dictator, or would run better with some less centralized system, largely depends on who is available to fill the role. As a general rule, if it's simply obvious to everyone who should be the BD, then that's the way to go. But if no candidate for BD is immediately obvious, then the project should probably use a decentralized decision-making process …"

And I will end this post by using the opening quote of Bacon's chapter 9:

"The people to fear are not those who disagree with you, but those who disagree with you and are too cowardly to let you know." —Napoléon Bonaparte

Food for thought.

19 Jun 2010

Open Remote Sensing

Right after the recent Haiti earthquake a community of volunteers put their efforts together in order to build up to date maps of the area in order to help the humanitarian aid on the terrain.

This community is the one who is contributing to Open Street Map (OSM), the wiki-like world wide map on the Internet. A detailed presentation about this activity can be watched here.

This kind of activity can only be carried out if the volunteers have the input data (images, gps tracks) and the appropriate tools (software).

In terms of software, the OSM people rely on open source state of the art tools which allow them to be quick and efficient. As far as I know, their infrastructure is based on OSGEO tools for the storage, the formats, the web services for edition and visualization, and so on.

So open people with open tools generating open maps is something which has been proven useful and efficient for a while now: just take a look at the status of OSM near where you live in order to convince yourself. This is another example of the Wisdom of Crowds phenomenon.

However, one thing which was really special in the Haiti case is the speed at which the maps where created after the earthquake. This may be surprising when one thinks that most of the work for OSM is based on gps tracks submitted by people which are on the terrain or by volunteers digitizing somewhat outdated aerial imagery.

In the Haiti case, what really allowed for a quick cartographic response was that space agencies and satellite image providers made freely available the images acquired not long before and after the event. This allowed for really accurate maps of the communication infrastructures and buildings as well as a damage assessment of those. Refugee camps could also be easily located.

Unfortunately, the availability of this kind of data is not usual. Even with initiatives as the Disasters Charter the images made available by the signing parties and the added value maps are distributed under licenses which are relatively restrictive.

And as one can easily understand, the open people using open tools to create open maps are completely useless if they don't have the images to work with. Many people will agree on the fact that this should change when lives are at stake. Other people will think that, even for general purpose cartography, data should be freely available. Of course, space agencies, the aerospace industries, etc. have hard (economical and industrial) constraints which don't allow them to give away expensive images.

And of course, a bunch of volunteers don't have the resources to put satellites in orbit and acquire images. One thing is writing free software and another thing is rocket science! Or is it?

It seems that a group of people are thinking about doing things not far from that.

The Open Aerial Map project aims at building a world wide cover with free aerial images. Even if the project has been dead for a while, it seems to be active again.

Another interesting initiative is the Paparazzi project which aims at building open source software and hardware for aerial unmaned vehicles. One of the projects using it is made by the University of Stuttgart and aims at developing a system for high resolution aerial imagery.

The list of interesting free projects which could be used to setup an open source global nearly real time mapping system is long. It is likely that all the bricks will be put together someday by volunteers.

If this comes to reality, I suggest just one restriction on the license: the user wanting to print the map will have to read Borges' "Universal History of Infamy" before choosing the scale.

27 Apr 2010

Open tools for modeling and simulation

I has been now nearly 2 months since I moved to a new job. My research subject has slightly evolved, from sub-meter resolution image information extraction (mainly object recognition for scene interpretation) towards high temporal resolution for environment applications. The change in the resolution dimension (from spatial to temporal) is the main re-orientation of the work, but if I have to be honest, this would not be challenging enough to justify leaving what I was doing before, the nice colleagues I had (though the new ones are also very nice), etc.

The main challenge of the new job is to dive in the world of physics and modelling. Although I did a lot of physical modelling and simulation in my PhD, this was 10 years ago and it was SAR over the ocean, while now it is mainly optical over continental surfaces.

I have the chance to have landed on a lab with people who are specialists of these topics, so I can ask lots of questions about very specific issues. The problem is that I am lacking basic knowledge about state of the art approaches and I don't like to bother my new colleagues (since they are nice!) with stupid questions.

There is where open available ressources are very helpful. I will just cite 2 pointers among the lots of relevant stuff I am using for my learning.

In terms of modelling the physical and biological processes in an agricultural field, I found Daisy, which is an open source simulation system developed by Søren Hansen's team at the University of Copenhagen. Added to the source code, there is a very rich theoretical documentation available.

Once these processes are understood and simulated, I was also interested in learning how things can be measured by optical sensors. I found Stephane Jacquemoud's PROSAIL approach for which source code and reference documentation (papers, PhD dissertations) are available online.

From there, I just put things together in order to learn with a hands on approach. PROSAIL is Fortran and Daisy is C++. I wanted to plot some simulated vegetation spectra. So I fired up Emacs and started writing some python code in order to loop over variable ranges, launch the simulators, plot things with Gnuplot.py, and so on. And then, I remembered that we have python wrappers for OTB, which would make possible the use of the 6S radiative transfer code using OTB's internal version of it.

And here we are, with a fully open source system which allows to do pysics-based remote sensing. Isn't that cool?