Jordi Inglada

Posts tagged "programming":

17 Apr 2023

Use ChatGPT and feed the technofeudal beast

tl;dr1: Big Tech is deploying AI assistants on their cloud work-spaces to help you work better, but are you sure that what you produce using these tools falls under your copyright and not that of your AI provider?


Unless you have been living under a rock for the last few months, you have heard about ChatGPT and you probably have played with it. You may even make part of those who have been using it for somewhat serious matters, like writing code, doing your homework, or writing reports for your boss.

There are many controversies around the use of these Large Language Models (LLM, ChatGPT is not the only one). Some people are worried about white collar workers being replaced by LLM as blue collar ones were by mechanical robots. Other people are concerned with the fact that these LLM have been trained on data under copyright. And yet another set of people may be worried about the use of LLM for massive fake information spreading.

All these concerns are legitimate2, but there is another one that I don't see addressed and that, from my point of view is, at least, as important as those above: the fact that these LLM are the ultimate stage of proprietary software and user lock-in. The final implementation of technofeudalism.

Yes, I know that ChatGPT has been developed by OpenAI (yes open) and they just want to do good for humankind:

OpenAI’s mission is to ensure that artificial general intelligence (AGI)—by which we mean highly autonomous systems that outperform humans at most economically valuable work—benefits all of humanity. We will attempt to directly build safe and beneficial AGI, but will also consider our mission fulfilled if our work aids others to achieve this outcome.

But remember, ChatGPT is not open source and it is not a common good or a public service, as opposed to Wikipedia, for instance.

The evolution of user lock-in in software

Proprietary software is software that comes with restrictions on what the user can do with it.

A long time ago, software used to come in a box, on floppy disks, cartridges or CD-ROM. The user put the disk on the computer and installed it. There was a time, were, even if the license didn't allow it, a user could lend the disk to friends so they could install it on their own computers. At that time, the only lock-in that the vendor could enforce was that of proprietary formats: the text editor you bought would store your writing in a format that only that software could write.

In order to avoid users sharing a single paid copy of the software, editors implemented artificial scarcity: a registration procedure. For the software to work, you needed to enter a code to unlock it. With the advent of the internet, the registration could be done online, and therefore, just one registration per physical box was possible. This was a second lock-in. Of course this was not enough, and editors implemented artificial obsolescence: expiring registrations. The software could stop working after a predefined period of time and a new registration was needed.

Up to this point, clever users always found a way to unlock the software, because software can be retro-engineered. In Newspeak, this is called pirating.

The next step was to move software to the cloud (i.e. the editor's computers). Now, you didn't install the software on your computer, you just connected to the editor's service and worked on your web browser. This allowed the editor to terminate your license at any time and even to implement pay-per-use plans. Added to that, your data was hosted on their servers. This is great, because you always have access to the most recent version of the software and your data is never lost if your computer breaks. This is also great for the editor, because they can see what you do with their software and have a peek at your data.

This software on the cloud has also allowed the development of the so-called collaborative work: several people editing the same document, for instance. This is great for you, because you have the feeling of being productive (don't worry, it's just a feeling). This is great for the editor, because they see how you work. Remember, you are playing a video-game on their computer and they store every action you do3, every interaction with your collaborators.

Now, with LLM, you are in heaven, because you have suggestions on how to write things, you can ask questions to the assistant and you are much more productive. This is also great for the editor, because they now not only know what you write, how you edit and improve your writing, but they also know what and how you think. Still better, your interactions with the LLM are used to improve it. This is a win-win situation, right?

How do LLM work?

The short answer is that nobody really knows. This is not because there is some secret, but rather because they are complex and AI researchers themselves don't have the mathematical tools to explain why they work. We can however still try to get a grasp of how they are implemented. You will find many resources online. Here I just show off a bit. You can skip this section.

LLM are a kind of artificial neural network, that is, a machine learning system that is trained with data to accomplish a set of tasks. LLM are trained with textual data, lots of textual data: the whole web (Wikipedia, forums, news sites, social media), all the books ever written, etc. The task they are trained to perform is just to predict the next set of words given a text fragment. You give the beginning of a sentence to the system and it has to produce the rest of it. You give it a couple of sentences and it has to finish the paragraph. You see the point.

What does "training a system" mean? Well, this kind of machine learning algorithm is just a huge mathematical function which implements millions of simple operations of the kind \(output = a \times input + b\). All these operations are plugged together, so that the output of the first operation becomes the input of the second, etc. In the case of a LLM, the very first output is the beginning of the sentence that the system has to complete. The very last output of the system is the "prediction" of the end of the sentence. You may wonder how you add or multiply sentences? Well, words and sentences are just transformed to numbers with some kind of dictionary, so that the system can do its operations. So to recap:

  1. you take text and transform it to sequences of numbers;
  2. you take these numbers and perform millions of operations like \(a\times input + b\) and combine them together;
  3. you transform back the output numbers into text to get the answer of the LLM.

Training the model means finding the millions of \(a\) and \(b\) (the parameters of the model) that perform the operations. If you don't train the model, the output is just rubbish. If you have enough time and computers, you can find the good set of parameters. Of course, there is a little bit of math behind all this, but you get the idea.

Once the model is trained, you can give it a text that it has never encountered (a prompt) and it will produce a plausible completion (the prediction). The impressive thing with these algorithms is that they do not memorize the texts used for training. They don't check a database of texts to generate predictions. They just apply a set of arithmetic operations that work on any input text. The parameters of the model capture some kind of high level knowledge about the language structure.

Well, actually, we don't know if LLM do not memorize the training data. LLM are to big and complex to be analyzed to check that. And this is one of the issues related to copyright infringement that some critics have raised. I am not interested in copyright here, but this impossibility to verify that training data can't appear at the output of a LLM will be of crucial importance later on.

Systems like ChatGPT, are LLM that have been fine-tuned to be conversational. The tuning has been done by humans preparing questions and answers so that the model can be improved. The very final step is RLHF: Reinforcement Learning from Human Feedback. Once the model has been trained and fine-tuned, they can be used and their answers can be rated by operators in order to avoid that the model spits out racist comments, sexist texts or anti-capitalist ideas. OpenAI who, remember, just want to do good to humankind abused under-payed operators for this task.

OK. Now you are an expert on LLM and you are not afraid of them. Perfect. Let's move on.


If producing a LLM is not so complex, why aren't we all training our own models? The reason is that training a model like GPT-3 (the engine behind ChatGPT) needs about 355 years on a single NVIDIA Tesla V100 GPU4, so if you want to do this in something like several weeks you need a lot of machines and a lot of electricity. And who has the money to do that? The big tech companies.

You may say that OpenAI is a non profit, but Wikipedia reminds us that Microsoft provided OpenAI with a $1 billion investment in 2019 and a second multi-year investment in January 2023, reported to be $10 billion. This is how LLM are now available in Github Copilot (owned by Microsoft), Bing and Office 365.

Of course, Google is doing the same and Meta may also be doing their own thing. So we have nothing to fear, since we have the choice. This is a free market.

Or maybe not. Feudal Europe was not really a free market for serfs. When only a handful of private actors have the means to build this kind of tools we become dependent on their will and interests. There is also something morally wrong here: LLM like ChatGPT have been trained on digital commons like Wikipedia or open source software, on all the books ever written, on the creations of individuals. Now, these tech giants have put fences around all that. Of course, we still have Wikipedia and all the rest, but now we are being told that we have to use these tools to be productive. Universities are changing their courses to teach programming with ChatGPT. Orange is the new black.

Or as Cédric Durand explains it:

The relationship between digital multinational corporations and the population is akin to that of feudal lords and serfs, and their productive behavior is more akin to feudal predation than capitalist competition.


Platforms are becoming fiefs not only because they thrive on their digital territory populated by data, but also because they exercise a power lock on services that, precisely because they are derived from social power, are now deemed indispensable.5

When your thoughts are not yours anymore

I said above that I am not interested in copyright, but you might be. Say that you use a LLM or a similar technology to generate texts. You can ask ChatGPT to write a poem or the lyrics of a song. You just give it a subject, some context and there it goes. This is lots of fun. The question now is: who is the author of the song? It's clearly not you. You didn't create it. If ChatGPT had some empathy, it could give you some kind of co-authorship, and you should be happy with that. You could even build a kind of Lennon-McCartney joint venture. You know, many of the Beatles' songs were either written by John or Paul, but they shared the authorship. They were good chaps. Until they weren't anymore.

You may have used Dall-E or Stable Diffusion. These are AI models which generate images from a textual description. Many people are using them (they are accessible online) to produce logos, illustrations, etc. The produced images contain digital watermarking. This is a kind of code which is inserted in the image and that is invisible to the eye. This allows to track authorship. This can be useful to prevent deep fakes. But this can also be used to ask you for royalties if ever the image generates revenue.

Watermarking texts seems impossible. At the end of the day, a text is just a sequence of characters, so you can't hide a secret code in it, right? Well, wrong. You can generate a text with some statistical patterns. But, OK, the technology does not seem mature and seems easy to break with just small modifications. You may therefore think that there is no risk of Microsoft or Google suing you for using their copyrighted material. Well, the sad thing is that you produced your song using their software on their computers (remember, SaaS, the cloud), so they have the movie showing what you did.

This may not be a problem if you are just writing a song for a party or producing an illustration for a meme. But many companies will soon start having employees (programmers, engineers, lawyers, marketers, writers, journalists) producing content with these tools on these platforms. Are they willing to share their profits with Microsoft? Did I tell you about technofeudalism?

Working for the man

Was this response better or worse? This is what ChatGPT asks you if you tell it to regenerate a response. You just have to check one box: better, worse, same. You did it? Great! You are a RLHFer6. This is why ChatGPT is free. Of course, it is free for the good of humankind, but you can help it to get even better! Thanks for your help. By the way, Github Copilot will also get better. Bing also will get better. And Office 365 too.

Thanks for helping us feed the beast.




Abstract for those who have attention deficit as a consequence of consuming content on Twitter, Instragram and the like.


Well, at least if you believe in intellectual property, or think that bullshit jobs are useful or that politicians and companies tell the truth. One has to believe in something, I agree.


Yes, they do. This is needed to being able to revert changes in documents.


"My" translation. You have an interesting interview in English here


Reinforcement Learning from Human Feedback. I explained that above!

Tags: open-source free-software programming surveillance-capitalism
27 May 2022


Parcoursup est une machine à frustrations.

Les règles de classement des candidats par chaque formation sont opaques. Dans les formations sélectives, l'écart de notation entre les dossiers des admis et de ceux non admis peut être extrêmement faible. Du fait que les bulletins de notes ont un poids non négligeable, on peut se poser la question de l'équité entre élèves de classes et d'établissements différents.

À cela, il faut ajouter l'angoisse de devoir faire des choix dans un temps limité avant de perdre l'accès à des formations où le candidat est accepté.

Tout cela fait que Parcoursup est un Lotto géant où ceux qui sont mal conseillés risquent de perdre.

J'ai fait un petit programme qui permet de jouer à Parcoursup pour apprendre à gérer le stress auquel élèves et parents serons soumis.

L'interface utilisateur est rudimentaire, mais je ne sais pas faire aussi joli que les EdTech!


C'est une simulation qui utilise le hasard pour générer des scénarios. Ce que vous obtiendrez, même si vous rentrez correctement les données correspondant à vos vœux Parcoursup, ce sont des résultats aléatoires. Cependant, ce n'est pas n'importe quoi!

Le simulateur utilise votre rang dans votre classe pour vous attribuer un rang dans chaque formation. Le taux d'acceptation de chaque formation est aussi utilisé. La simulation de l'évolution des listes d'attente essaie d'être plausible, mais j'ai fait des choix arbitraires. Si cela vous intéresse, regardez le code. J'ai essayé de le documenter avec des explications détaillées.

Entre 2 parties du jeu, même si vous gardez les mêmes données d'entrée, les résultats seront différents. L'idée est que vous puissiez faire plein de simulations pour être confrontés à des choix qui ressemblent à ceux que vous pourriez être amenés à faire.


Vous aurez besoin de Python 3.x. Autrement, il suffit de récupérer le fichier

Vous devez adapter le programme à votre cas à votre cas particulier.

D'abord, vous devez modifier les valeurs suivantes :

# Votre rang dans votre classe

# Nombre d'élèves dans votre classe

Ensuite, vous devez rentrer vos vœux :

# Un voeux est un tuple ("nom", nb_places, rang_dernier_2021, taux_acceptation)
        "MIT - Aerospace",
        "Élite - Youtubeur ",
        "Oxford - Maths",
        "Stanford - Artificial Intelligence",
        "HEC pour fils à Papa",
    ("Fac de truc qui sert à rien", 1500, 145300, 0.9999),
    ("Éleveur de chèvres dans les Pyrénées", 180, 652, 0.1),

Les informations nécessaires (nombre de places, rang du dernier accepté l'année précédente et taux d'acceptation) sont disponibles dans votre espace Parcoursup.

Tuto indispensable

Vous pouvez lancer le programme en ligne de commande comme ceci :


Si vous utilisez un éditeur comme Pyzo ou Spyder, vous pouvez simplement exécuter le programme depuis l'éditeur.

Une fois lancé, vous aurez ceci :

$ python 
Bienvenue dans LottoSup™.
Appuyez sur une touche pour jouer!

Appuyez sur une touche pour lancer la machine infernale et retrouver les résultats du premier jour. Vous pourriez avoir, par exemple, ce résultat :

################# Jour 1 #####################

        ======== Status.ACCEPT ========
6) Fac de truc qui sert à rien
        1500 places - rang 2021 : 145300
        votre rang : 655 - rang dernier : 4929
        date limite : jour 5
        Statut : Status.ACCEPT 🎉

        ======== Status.WAIT ========
2) Élite - Youtubeur 
        25 places - rang 2021 : 79
        votre rang : 82 - rang dernier : 27
        date limite : jour 45
        Statut : Status.WAIT ⏳

4) Stanford - Artificial Intelligence
        42 places - rang 2021 : 322
        votre rang : 182 - rang dernier : 64
        date limite : jour 45
        Statut : Status.WAIT ⏳

1) MIT - Aerospace
        60 places - rang 2021 : 296
        votre rang : 259 - rang dernier : 66
        date limite : jour 45
        Statut : Status.WAIT ⏳

7) Éleveur de chèvres dans les Pyrénées
        180 places - rang 2021 : 652
        votre rang : 782 - rang dernier : 191
        date limite : jour 45
        Statut : Status.WAIT ⏳

Acceptation définitive [idx ou 0] :     

On peut voir la liste des formations où vous êtes accepté et celles où vous êtes en liste d'attente. Pour chaque formation, vous avez les informations que Parcoursup est censé vous donner :

  • le nombre de places dans la formation
  • le rang (classement) du dernier qui a été accepté l'année précédente
  • votre rang pour cette formation
  • le rang du dernier à avoir été accepté cette année (si votre rang est inférieur, vous êtes accepté, sinon, vous êtes en liste d'attente)
  • la date limite de réponse pour cette formation avant que l'offre d'acceptation expire (si vous êtes en liste d'attente, cette date correspond au dernier jour de la phase principale, que j'ai fixé à 45)

Ensuite, vous devez faire vos choix. Il faut d'abord répondre si vous voulez accepter définitivement une formation où vous êtes accepté. Rentrez le numéro qui apparaît dans la liste, à gauche du nom de la formation (6 dans l'exemple ci-dessus). Pour ne pas faire de choix, rentrez 0. Si vous acceptez une formation, le jeu est fini.

On vous demande ensuite si vous voulez faire une acceptation temporaire :

Acceptation temporaire [idx ou 0] : 

Si vous faites une acceptation temporaire, vous refusez automatiquement toutes les autres où vous êtes accepté, mais cela n'a pas d'impact sur la liste d'attente.

La question suivante concerne les formations que vous voulez refuser :

Rejets [id1 id2 ... ou 0] : 

Ici, on peut donner une liste d'indices parmi ceux des formations où vous êtes accepté ou en liste d'attente. Elles disparaîtront de la liste le jour suivant. Si vous ne voulez rien refuser, rentrez 0.

Et on passe au jour suivant. Il se peut que rien ne change, mais probablement, le rang du dernier accepté dans les formations va évoluer. Ceci peut vous permettre d'être accepté à d'autres formations.

Quand on arrive au jour 5, c'est la date d'expiration des offres que vous avez reçu jusque là :

################# Jour 5 #####################

        ======== Status.ACCEPT ========
6) Fac de truc qui sert à rien
        1500 places - rang 2021 : 145300
        votre rang : 655 - rang dernier : 42447
        date limite : jour 5
        Statut : Status.ACCEPT 🎉

        ======== Status.WAIT ========
2) Élite - Youtubeur 
        25 places - rang 2021 : 79
        votre rang : 82 - rang dernier : 31
        date limite : jour 45
        Statut : Status.WAIT ⏳

4) Stanford - Artificial Intelligence
        42 places - rang 2021 : 322
        votre rang : 182 - rang dernier : 84
        date limite : jour 45
        Statut : Status.WAIT ⏳

1) MIT - Aerospace
        60 places - rang 2021 : 296
        votre rang : 259 - rang dernier : 91
        date limite : jour 45
        Statut : Status.WAIT ⏳

7) Éleveur de chèvres dans les Pyrénées
        180 places - rang 2021 : 652
        votre rang : 782 - rang dernier : 285
        date limite : jour 45
        Statut : Status.WAIT ⏳

Acceptation définitive [idx ou 0] :   

Ici, si je ne fais pas une acceptation temporaire de la formation no. 6, elle disparaîtra le jour suivant. Si je l'accepte temporairement, sa date limite de réponse est repoussée à la fin de la phase principale de Parcoursup. Voici mes réponses :

Acceptation définitive [idx ou 0] : 0 
Acceptation temporaire [idx ou 0] : 6
Rejets [id1 id2 ... ou 0] : 0

Et donc le jour suivant j'ai ça :

################# Jour 6 #####################

        ======== Status.ACCEPT ========
6) Fac de truc qui sert à rien
        1500 places - rang 2021 : 145300
        votre rang : 655 - rang dernier : 45414
        date limite : jour 45
        Statut : Status.ACCEPT 🎉

        ======== Status.WAIT ========
2) Élite - Youtubeur 
        25 places - rang 2021 : 79
        votre rang : 82 - rang dernier : 31
        date limite : jour 45
        Statut : Status.WAIT ⏳

4) Stanford - Artificial Intelligence
        42 places - rang 2021 : 322
        votre rang : 182 - rang dernier : 84
        date limite : jour 45
        Statut : Status.WAIT ⏳

1) MIT - Aerospace
        60 places - rang 2021 : 296
        votre rang : 259 - rang dernier : 97
        date limite : jour 45
        Statut : Status.WAIT ⏳

7) Éleveur de chèvres dans les Pyrénées
        180 places - rang 2021 : 652
        votre rang : 782 - rang dernier : 307
        date limite : jour 45
        Statut : Status.WAIT ⏳

J'ai donc assuré une formation et je peux attendre de remonter dans la liste d'attente. Il ne me reste que 51 places pour apprendre à devenir Youtubeur.

Quelques détails de mise en œuvre

J'ai seulement utilisé des modules disponibles dans la bibliothèque standard Python. J'avais fait une première version qui utilisait Numpy pour les tirages aléatoires, mais finalement le module math contient tout, y compris le tirage avec une loi exponentielle.

import math
import random
import sys
from enum import Enum, auto

Chaque formation a un état différent (acceptation, liste d'attente, etc.). Je modélise ça avec des énumérations. Ce sont des ensembles de valeurs restreintes :

class Status(Enum):
    ACCEPT = "🎉"  # la formation a accepté
    WAIT = "⏳"
    REJECT = "☠️"  # la formation a rejeté
    DROP = "💪"  # le candidat a rejeté
    EXPIRED = "⌛"  # date limite dépassée
    CHECK = "☑️"  # le candidat a accepté définitivement

Les réponses des candidats aux propositions d'acceptation ou de liste d'attente, sont modélisées de la même façon :

class Choix(Enum):
    WAIT = auto()  # le candidat maintient l'attente
    DROP = auto()  # le candidat a rejeté
    CHECK = auto()  # le candidat a accepté définitvement
    ACCEPT = auto()  # acceptation non définitive

Cela peut sembler redondant, mais les valeurs ne sont pas les mêmes et, surtout, elles n'ont pas la même signification. Par exemple, ACCEPT de la part du candidat veut dire qu'il réserve une place, tandis qu'un ACCEPT de la part d'une formation, ne veut pas dire que le candidat accepte.

Ensuite, je modélise chaque formation par une classe. Le candidat a un rang dans la formation et les acceptations se font en fonction du nombre de places disponibles et le rang du candidat. On gère une liste d'attente qui s'actualise chaque jour en fonction des refus des autres candidats.

class Formation:
    def __init__(self, idx, nom, places, rang_dernier_2021, taux_accept):
        self.nom = nom
        self.idx = idx
        self.total_places = places
        self.rang_dernier_2021 = rang_dernier_2021
        self.rang_dernier = self.total_places
        self.taux_acceptation = taux_accept
        self.votre_rang = self.compute_rang()
        self.date_limite = DUREE_PS  # jour limite pour répondre
        self.places_disponibles = self.total_places
        self.status = None  # statut du candidat dans la formation

La méthode __init__ est le constructeur de la classe. Elle initialise la formation à partir des paramètres suivants :

un indice unique pour chaque formation qui nous permettra de gérer les réponses de l'utilisateur
le nom de la formation (pour l'affichage)
nombre de places disponibles
rang du dernier candidat admis l'année précédente
le taux d'acceptation de la formation

Afin de pouvoir afficher une formation avec print(), nous définissons la méthode =__repr__= :

    def __repr__(self):
        """ Pour un affichage avec print()"""
        return (
            f"{self.idx}) {self.nom}\n"
            f"\t{self.total_places} places -"
            f" rang 2021 : {self.rang_dernier_2021}\n"
            f"\tvotre rang : {self.votre_rang} -"
            f" rang dernier : {self.rang_dernier}\n"
            f"\tdate limite : jour {self.date_limite}\n"
            f"\tStatut : {self.status} {self.status.value}\n"

Plus intéressante est la façon de définir le rang du candidat (son classement) dans la formation. Avec le nombre de places et le taux d'acceptation, on calcule le nombre de candidats. Le rang du candidat parmi tous les autres est calculé à partir du rang dans sa classe par une simple règle de trois. Pour donner un peu de réalisme, on fait un tirage aléatoire avec une loi gaussienne de moyenne égale à ce rang et un écart type de 10 (parce que!). La gaussienne étant à support non borné, on coupe entre 0 et le nombre de candidats.

    def compute_rang(self):
        nombre_candidats = self.total_places / self.taux_acceptation
        rang_moyen = RANG_CLASSE / TAILLE_CLASSE * nombre_candidats
        tirage = random.gauss(rang_moyen, 10)
        rang = int(min(nombre_candidats, max(tirage, 0)))
        return rang

Cette méthode est appelée par le constructeur lors de l'initialisation de la formation.

À chaque fois que le candidat reçoit une offre de formation, elle est accompagnée d'une date limite de réponse :

    def update_date_limite(self, jour):
        if jour < 5:
            self.date_limite = 5
            self.date_limite = jour + 1

Pour simuler l'évolution de la liste d'attente chaque jour, on calcule simplement le rang du dernier accepté dans la formation. Cela nous évite de devoir simuler les refus de candidats (c'est des adolescents, ils sont imprévisibles).

La position du dernier admis avance avec selon une densité de probabilité exponentielle (il est plus probable que la position du dernier avance de peu de places que de beaucoup) :

\[p(x) = \lambda e^{-\lambda x}\]

La moyenne de cette loi (λ), décroît avec les jours qui passent (davantage de candidats se désistent les premiers jours que par la suite).

Pour fixer λ, on part du 99è quantile (la valeur de la variable pour laquelle on a 99% de chances d'être en dessous). Pour la distribution exponentielle, la fonction quantile est : \[q = -\ln(1-p)/\lambda\]

De façon arbitraire, on prend q égal à un dixième de la longueur de la liste d'attente.

    def update_rang_dernier(self, jour):
        longueur_attente = max(
            self.rang_dernier_2021 - self.rang_dernier + 1, 1
        jours_restants = DUREE_PS - jour + 1
        q = longueur_attente / 10
        p = min(
            0.999, float(jours_restants / DUREE_PS) * 0.99
        )  # on fixe le max à 0.999 pour éviter des exceptions dans le log
        lam = max(-(math.log(1 - p) / q), 1e-15)
        tirage = random.expovariate(lam)
        self.rang_dernier += int(tirage)

Le statut du candidat dans la formation est actualisé chaque jour. On utilise la méthode suivante dans 2 cas :

  1. la mise à jour automatique du système chaque nuit,
  2. la mise à jour après choix du candidat.

La mise à jour du lotto Parcoursup est faite de la façon suivante. Au départ, un candidat est accepté par une formation si son rang est inférieur ou égal au nombre de places de la formation. Les jours suivants, il est accepté si son rang est inférieur au dernier admis cette année. Un candidat est refusé par une formation si son rang es de 20% supérieur à celui du dernier admis l'année dernière (je ne pense pas que ce soit la règle utilisée par les formations, mais j'aime bien être sévère avec ces jeunes …). On gère aussi l'expiration des délais d'attente de réponse. Si le délai est dépassé, l'offre expire.

La gestion des choix du candidat consiste à mettre à jour le statut de la formation en fonction de sa réponse.

    def update_status(self, jour, choix: Choix = None):
        if choix is None:
            if self.votre_rang <= self.rang_dernier and (
                self.status == Status.WAIT or self.status == Status.ACCEPT
                if self.status != Status.ACCEPT:
                    self.status = Status.ACCEPT
            elif self.votre_rang >= self.rang_dernier_2021 * 1.2:
                self.status = Status.REJECT
                self.status = Status.WAIT
            if self.status == Status.ACCEPT and jour > self.date_limite:
                self.status = Status.EXPIRED
        elif choix == Choix.CHECK:
            self.status = Status.CHECK
        elif choix == Choix.DROP:
            self.status = Status.DROP
        elif choix == Choix.WAIT:
            self.status = Status.WAIT

Et c'est tout pour les formations.

On passe ensuite à la simulation des actualisation journalières du système. On crée aussi une classe pour cela.

class Parcoursup:
    def __init__(self, voeux):
        self.voeux = voeux
        self.jour = 0
        for v in self.voeux:

La classe est construite avec les vœux du candidat le jour 0. Pour chaque vœux, on fait une mise à jour définies par les formations.

Ensuite, on a une méthode pour l'itération journalière du système. C'est simple :

  • on commence par mettre à jour en chaque formation pour le jour courant,
  • on affiche les résultats de l'algo PS®,
  • on demande au candidat de faire ses choix et
  • on élimine les formations refusées par le candidat.
    def iteration(self):

        self.jour += 1
        for v in self.voeux:

Pour faire tout ça, on aura besoin de quelques petites méthodes. D'abord, il nous faut pouvoir récupérer un voeu à partir de son identifiant unique :

    def get_voeu(self, idx):
        for v in self.voeux:
            if v.idx == idx:
                return v

Si le candidat accepte temporairement une formation, il refuse automatiquement toutes les autres où il a été accepté. That's life. Bienvenue au monde des adultes …

    def drop_all_accept_except(self, accept_tmp):
        for v in self.voeux:
            if v.idx != accept_tmp and v.status == Status.ACCEPT:
                v.update_status(self.jour, Choix.DROP)

Une petite fonction pour poser une question à l'utilisateur et récupérer sa réponse :

    def get_input(self, message, single_value=False):
        print(message, end=" ")
        resp = input()
        if single_value:
                val = int(resp)
                return val
            except ValueError:
                self.get_input(message, True)
            return resp.split()

Ici, on fait l'interaction avec l'utilisateur pour gérer les choix. On commence par lui demander s'il veut accepter définitivement une formation, on passe ensuite à l'acceptation temporaire et on finit par récupérer les formations refusées. En cas d'acceptation (définitive ou temporaire) on vérifie qu'il choisit des formations pour lesquelles il a été accepté! Si vous avez des ados à la maison, vous comprendrez que c'est une bonne précaution à prendre …

    def choice(self):
        accept_def = self.get_input(
            "Acceptation définitive [idx ou 0] :", True
        while accept_def != 0:
            v_accept = self.get_voeu(accept_def)
            if v_accept.status == Status.ACCEPT:
                v_accept.update_status(self.jour, Choix.CHECK)
                    "Félicitations, vous avez accepté définitivement"
                    " la formation\n"
                    "Vous n'avez pas été accepté dans la formation\n"
                accept_def = self.get_input(
                    "Acceptation définitive [idx ou 0] :", True

        accept_tmp = self.get_input(
            "Acceptation temporaire [idx ou 0] :", True
        while accept_tmp != 0:
            v_accept = self.get_voeu(accept_tmp)
            if v_accept.status == Status.ACCEPT:
                v_accept.date_limite = DUREE_PS
                    "Vous n'avez pas été accepté dans la formation\n"
                accept_tmp = self.get_input(
                    "Acceptation temporaire [idx ou 0] :", True
        rejets = self.get_input("Rejets [id1 id2 ... ou 0] :", False)
        if rejets[0] == "0":
            return None
        for rejet in rejets:
            v_rejet = self.get_voeu(int(rejet))
            if v_rejet is not None:
                v_rejet.update_status(self.jour, Choix.DROP)

Il y a quelques doublons dans le code qui auraient pu être factorisés dans une autre méthode, mais ça fait l'affaire.

Voici la méthode pour éliminer les formations qui ne sont ni acceptées ni en liste d'attente :

    def clean(self):
        self.voeux = {
            for v in self.voeux
            if (v.status == Status.ACCEPT or v.status == Status.WAIT)

Et finalement, une fonction pour afficher les résultats du lotto journalier. Seulement les formations où le candidat est accepté ou en liste d'attente sont affichées. Il ne faut pas remuer le couteau dans la plaie!

    def print(self):
        print(f"################# Jour {self.jour} #####################\n")
        for s in [Status.ACCEPT, Status.WAIT]:
            print(f"\t======== {s} ========")
            for v in self.voeux:
                if v.status == s:

Nous avons terminé avec la classe Parcoursup.

La fonction principale pour faire tourner la simulation. On lit les veux du candidat et on initialise la simulation. On simule (DUREE_PS = 45 par défaut) jours d'agonie et de stress.

def main():
    voeux = {
        Formation(idx, n, p, r, tr)
        for idx, (n, p, r, tr) in enumerate(VOEUX, start=1)
    ps = Parcoursup(voeux)
    print("Bienvenue dans LottoSup™.\n" "Appuyez sur une touche pour jouer!\n")
    for it in range(DUREE_PS):

Le code est relativement simple et court, ce qui permet d'en changer le comportement relativement facilement.

Tags: fr programming python education politics
27 May 2015

Installing OTB has never been so easy

You've heard about the Orfeo Toolbox library and its wonders, but urban legends say that it is difficult to install. Don't believe that. Maybe it was difficult to install, but this is not the case anymore.

Thanks to the heroic work of the OTB core development team, installing OTB has never been so easy. In this post, you will find the step-by-step procedure to compile OTB from source on a Debian 8.0 Jessie GNU/Linux distribution.

Prepare the user account

I assume that you have a fresh install. The procedure below has been tested in a virtual machine running under VirtualBox. The virtual machine was installed from scratch using the official netinst ISO image.

During the installation, I created a user named otb that I will use for the tutorial below. For simplicity, I give this user root privileges in order to install some packages. This can be done as follows. Log in as root or use the command:

su -

You can then edit the /etc/sudoers file by using the following command:


This will open the file with the nano text editor. Scroll down to the lines containing

# User privilege specification
root    ALL=(ALL:ALL) ALL

and copy the second line and below and replace root by otb:

otb     ALL=(ALL:ALL) ALL

Write the file and quit by doing C^o ENTER C^x. Log out and log in as otb. You are set!

System dependencies

Now, let's install some packages needed to compile OTB. Open a terminal and use aptitude to install what we need:

sudo aptitude install mercurial \
     cmake-curses-gui build-essential \
     qt4-dev-tools libqt4-core \
     libqt4-dev libboost1.55-dev \
     zlib1g-dev libopencv-dev curl \
     libcurl4-openssl-dev swig \

Get OTB source code

We will install OTB in its own directory. So from your $HOME directory create a directory named OTB and go into it:

mkdir OTB
cd OTB

Now, get the OTB sources by cloning the repository (depending on your network speed, this may take several minutes):

hg clone

This will create a directory named OTB (so in my case, this is /home/otb/OTB/OTB).

Using mercurial commands, you can choose a particular version or you can go bleeding edge. You will at least need the first release candidate for OTB-5.0, which you can get with the following commands:

cd OTB
hg update 5.0.0-rc1
cd ../

Get OTB dependencies

OTB's SuperBuild is a procedure which deals with all external libraries needed by OTB which may not be available through your Linux package manager. It is able to download source code, configure and install many external libraries automatically.

Since the download process may fail due to servers which are not maintained by the OTB team, a big tarball has been prepared for you. From the $HOME/OTB directory, do the following:

tar xvjf SuperBuild-archives.tar.bz2

The download step can be looooong. Be patient. Go jogging or something.

Compile OTB

Once you have downloaded and extracted the external dependencies, you can start compiling OTB. From the $HOME/OTB directory, create the directory where OTB will be built:

mkdir -p SuperBuild/OTB

At the end of the compilation, the $HOME/OTB/SuperBuild/ directory will contain a classical bin/, lib/, include/ and share/ directory tree. The $HOME/OTB/SuperBuild/OTB/ is where the configuration and compilation of OTB and all the dependencies will be stored.

Go into this directory:

cd SuperBuild/OTB

Now we can configure OTB using the cmake tool. Since you are on a recent GNU/Linux distribution, you can tell the compiler to use the most recent C++ standard, which can give you some benefits even if OTB still does not use it. We will also compile using the Release option (optimisations). The Python wrapping will be useful with the OTB Applications. We also tell cmake where the external dependencies are. The options chosen below for OpenJPEG make OTB use the gdal implementation.

cmake \
    -DCMAKE_CXX_FLAGS:STRING=-std=c++14 \
    -DCMAKE_INSTALL_PREFIX:PATH=/home/otb/OTB/SuperBuild/ \
    -DDOWNLOAD_LOCATION:PATH=/home/otb/OTB/SuperBuild-archives/ \

After the configuration, you should be able to compile. I have 4 cores in my machine, so I use the -j4 option for make. Adjust the value to your configuration:

make -j4

This will take some time since there are many libraries which are going to be built. Time for a marathon.

Test your installation

Everything should be compiled and available now. You can set up some environment variables for an easier use of OTB. You can for instance add the following lines at the end of $HOME/.bashrc:

export OTB_HOME=${HOME}/OTB/SuperBuild
export PATH=${OTB_HOME}/bin:$PATH

You can now open a new terminal for this to take effect or use:

source .bashrc 

You should now be able to use the OTB Applications. For instance, the command:


should display the documentation for the BandMath application.

Another way to run the applications, is using the command line application launcher as follows:

otbApplicationLauncherCommandLine BandMath $OTB_HOME/lib/otb/applications/


The SuperBuild procedure allows to easily install OTB without having to deal with different combinations of versions for the external dependencies (TIFF, GEOTIFF, OSSIM, GDAL, ITK, etc.).

This means that once you have cmake and a compiler, you are pretty much set. QT4 and Python are optional things which will be useful for the applications, but they are not required for a base OTB installation.

I am very grateful to the OTB core development team (Julien, Manuel, Guillaume, the other Julien, Mickaël, and maybe others that I forget) for their efforts in the work done for the modularisation and the development of the SuperBuild procedure. This is the kind of thing which is not easy to see from the outside, but makes OTB go forward steadily and makes it a very mature and powerful software.

Tags: remote-sensing otb programming open-source free-software tricks
13 May 2015

A simple thread pool to run parallel PROSail simulations

In the otb-bv we use the OTB versions of the Prospect and Sail models to perform satellite reflectance simulations of vegetation.

The code for the simulation of a single sample uses the ProSail simulator configured with the satellite relative spectral responses, the acquisition parameters (angles) and the biophysical variables (leaf pigments, LAI, etc.):

ProSailType prosail;
auto result = prosail();

A simulation is computationally expensive and it would be difficult to parallelize the code. However, if many simulations are going to be run, it is worth using all the available CPU cores in the machine.

I show below how using C++11 standard support for threads allows to easily run many simulations in parallel.

Each simulation uses a set of variables given in an input file. We parse the sample file in order to get the input parameters for each sample and we construct a vector of simulations with the appropriate size to store the results.

otbAppLogINFO("Processing simulations ..." << std::endl);
auto bv_vec = parse_bv_sample_file(m_SampleFile);
auto sampleCount = bv_vec.size();
otbAppLogINFO("" << sampleCount << " samples read."<< std::endl);
std::vector<SimulationType> simus{sampleCount};

The simulation function is actually a lambda which will sequentially process a sequence of samples and store the results into the simus vector. We capture by reference the parameters which are the same for all simulations (the satellite relative spectral responses satRSR and the acquisition angles in prosailPars):

auto simulator = [&](std::vector<BVType>::const_iterator sample_first,
                     std::vector<BVType>::const_iterator sample_last,
                     std::vector<SimulationType>::iterator simu_first){
  ProSailType prosail;
  while(sample_first != sample_last)
    *simu_first = prosail();

We start by figuring out how to split the simulation into concurrent threads. How many cores are there?

auto num_threads = std::thread::hardware_concurrency();
otbAppLogINFO("" << num_threads << " CPUs available."<< std::endl);

So we define the size of the chunks we are going to run in parallel and we prepare a vector to store the threads:

auto block_size = sampleCount/num_threads;
if(num_threads>=sampleCount) block_size = sampleCount;
std::vector<std::thread> threads(num_threads);

Here, I choose to use as many threads as cores available, but this could be changed by a multiplicative factor if we know, for instance that disk I/O will introduce some idle time for each thread.

An now we can fill the vector with the threads that will process every block of simulations :

auto input_start = std::begin(bv_vec);
auto output_start = std::begin(simus);
for(size_t t=0; t<num_threads; ++t)
  auto input_end = input_start;
  std::advance(input_end, block_size);
  threads[t] = std::thread(simulator,
  input_start = input_end;
  std::advance(output_start, block_size);

The std::thread takes the name of the function object to be called, followed by the arguments of this function, which in our case are the iterators to the beginning and the end of the block of samples to be processed and the iterator of the output vector where the results will be stored. We use std::advance to update the iterator positions.

After that, we have a vector of threads which have been started concurrently. In order to make sure that they have finished before trying to write the results to disk, we call join on each thread, which results in waiting for each thread to end:

otbAppLogINFO("" << sampleCount << " samples processed."<< std::endl);
for(const auto& s : simus)

This may no be the most efficient solution, nor the most general one. Using std::async and std::future would have allowed not to have to deal with block sizes, but in this solution we can easily decide the number of parallel threads that we want to use, which may be useful in a server shared with other users.

Tags: remote-sensing otb research programming open-source cpp algorithms
25 Mar 2015

Scripting in C++


This quote of B. Klemens1 that I used in my previous post made me think a little bit:

"I spent much of my life ignoring the fundamentals of computing and just hacking together projects using the package or language of the month: C++, Mathematica, Octave, Perl, Python, Java, Scheme, S-PLUS, Stata, R, and probably a few others that I’ve forgotten."

Although much less skilled than M. Klemens, I have also wandered around the programming language landscape, out of curiosity, but also with particular needs in mind.

For example, I listed a set of features that I found useful for easily going from prototyping to operational code for remote sensing image processing. The main 3 bullet points were:

  1. OTB accessibility, which meant C++, Python or Java (and its derivatives as Clojure);
  2. Robust concurrent programming, which ruled out Python;
  3. Having a REPL, which led me to say silly things and even develop working code.

Before going the OTB-from-LISP-through-bindings route, I had looked for C++ interpreters to come close to a REPL, some are available, but, at least at the time, I didn't manage to get one running.

Since I started using C++11 a little bit, I find more and more pleasant to write C++ even for small things that I would usually use Python for. Add to that all the new things in the standard library, like threads, regular expressions, tuples and proper random number generators, just to name a few, and here I am wondering how to make the edit/compile/run loop shorter.

B. Klemens proposes crazy things like compiling via cut/paste, but this is too extreme even for me. Since I use emacs, I can compile from within the editor just by hitting C-c C-c which asks for a command to compile. I usually do things like:

export ITK_AUTOLOAD_PATH="" && cd ~/Dev/builds/otb-bv/ && make \
    && ctest -I 8,8 -VV && gnuplot ~/Dev/otb-bv/data/plot_bvMultiT.gpl

I am nearly there, but not yet.

Remember that I discovered Klemens because I was looking for statistical libs after watching some videos of the Coursera Data Science specialization? One of the courses is about reproducible research and they show how you can mix together R code and Markdown to produce executable research papers.

When I saw this R Markdown approach, I thought, well, that's nice, but inferior to org-babel. The main advantage of org-babel with respect to R Markdown or IPython is that in org-mode you can use different languages in the same document.

In a recent research report written in org-mode and exported to PDF, I used, in the same document, Python, bash, gnuplot and maxima. This is a real use case. Of course, these are all interpreted languages, so it is easy to embed them in documents and talk to the interpreter.

Well, 15 years after I started using emacs and more than 6 years after discovering org-mode, I found that compiled languages can be used with org-babel. At least, decent compiled languages, since there is no mention of Objective-C or Java2.

Setting up and testing

So how one does use C++ with org-babel? Just drop this into your emacs configuration:

Yes, don't forget the flag for C++11, since you want to be cool.

Now, let the fun begin. Open an org buffer and write:

#+begin_src C++ :includes <iostream>
std::cout << "Hello scripting world\n";

Go inside the code block and hit C-c C-c and you will magically see this appear just below the code block:

: Hello scripting world

No problem. You are welcome. Good-bye.

If you want to go beyond the Hello world example keep reading.

Using the standard library

You have noticed that the first line of the code block says it is C++ and gives information about the header files to include. If you need to include more than one header file3, just make a list of them (remember, emacs is a LISP machine):

#+begin_src C++ :includes (list "<iostream>" "<vector>")
std::vector<int> v{1,2,3};
for(auto& val : v) std::cout << val << "\n";

| 1 |
| 2 |
| 3 |

Do you see the nice formatted results? This is an org-mode table. More fun with the standard library and tables:

#+begin_src C++ :includes (list "<iostream>" "<string>" "<map>")
std::map<std::string, int> m{{"pizza",1},{"muffin",2},{"cake",3}};
for(auto& val : m) std::cout << val.first << " " << val.second << "\n";

| cake   | 3 |
| muffin | 2 |
| pizza  | 1 |

Using other libraries

Sometimes, you may want to use other libraries than the C++ standard one4. Just give the header files as above and also the flags (here I split the header of the org-babel block in 2 lines for readability):

#+header: :includes (list "<gsl/gsl_fit.h>" "<stdio.h>") 
#+begin_src C++ :exports both :flags (list "-lgsl" "-lgslcblas")
int i, n = 4;
double x[4] = { 1970, 1980, 1990, 2000 };
double y[4] = {   12,   11,   14,   13 };
double w[4] = {  0.1,  0.2,  0.3,  0.4 };

double c0, c1, cov00, cov01, cov11, chisq;

gsl_fit_wlinear (x, 1, w, 1, y, 1, n, 
                 &c0, &c1, &cov00, &cov01, &cov11, 

printf ("# best fit: Y = %g + %g X\n", c0, c1);
printf ("# covariance matrix:\n");
printf ("# [ %g, %g\n#   %g, %g]\n", 
        cov00, cov01, cov01, cov11);
printf ("# chisq = %g\n", chisq);

for (i = 0; i < n; i++)
  printf ("data: %g %g %g\n", 
          x[i], y[i], 1/sqrt(w[i]));

printf ("\n");

for (i = -30; i < 130; i++)
  double xf = x[0] + (i/100.0) * (x[n-1] - x[0]);
  double yf, yf_err;

  gsl_fit_linear_est (xf, 
                      c0, c1, 
                      cov00, cov01, cov11, 
                      &yf, &yf_err);

  printf ("fit: %g %g\n", xf, yf);
  printf ("hi : %g %g\n", xf, yf + yf_err);
  printf ("lo : %g %g\n", xf, yf - yf_err);

| #     | best       |    fit: |       Y | = | -106.6 | + | 0.06 | X |
| #     | covariance | matrix: |         |   |        |   |      |   |
| #     | [          |  39602, |   -19.9 |   |        |   |      |   |
| #     | -19.9,     |   0.01] |         |   |        |   |      |   |
| #     | chisq      |       = |     0.8 |   |        |   |      |   |


Defining things outside main

You may have noticed that we are really scripting like perl hackers, since there is no main() function in our code snippets. org-babel kindly generates all the boilerplate for us. Thank you.

But wait, what happens if I want to define or declare things outside main?

No problem, just tell org-babel not to generate a main():

#+begin_src C++ :includes <iostream> :main no
struct myStr{int i;};

int main()
myStr ms{2};
std::cout << "The value of the member is " << ms.i << std::endl;

: The value of the member is 2

There are many other options you can play with and I suggest to read the documentation. Just 2 more useful things.

Using org-mode tables as data

You saw that org-mode has tables. They are very powerful and can even be used as a spreadsheet. Let's see how they can be used as input to C++ code.

Let's start by giving a name to our table:

#+tblname: somedata
|  sqr | noise |
|  0.0 |  0.23 |
|  1.0 |  1.31 |
|  4.0 |  4.61 |
|  9.0 |  9.05 |
| 16.0 | 16.55 |

Now, we can refer to it from org-babel using variables:

#+header: :exports results :includes <iostream>
#+begin_src C++ :var somedata=somedata :var somedata_cols=2 :var somedata_rows=5
for (int i=0; i<somedata_rows; i++)
  for (int j=0; j<somedata_cols; j++) 
    std::cout << somedata[i][j]/2.0 << "\t";
  std::cout << "\n";

|   0 | 0.115 |
| 0.5 | 0.655 |
|   2 | 2.305 |
| 4.5 | 4.525 |
|   8 | 8.275 |

Tangle when happy

Once you have scripted your solution to save the world from hunger5, you may want to get it out from org-mode and upload it to github6. You can of course copy and paste the code, but since you are using a superior editor, you can ask it to do that7 for you. Just tell it the name of your file using the tangle option:

#+begin_src C++ :tangle MyApp.cpp :main no :includes <iostream>
namespace ns {
    class MyClass {};

int main(int argc, char* argv[])
std::cout << "hi" << std::endl;
return 0;

If you the call org-babel-tangle from this block, it will generate a file with the right name containing this:

#include <iostream>
namespace ns {
    class MyClass {};

int main(int argc, char* argv[])
std::cout << "hi" << std::endl;
return 0;


Since reproducible research in emacs has entered the realm of scientific journals8, I will end as follows:

In this article, we have shown the feasibility of coding in C++ without writing much boilerplate and going through the edit/compile/run loop. With the advent of modern C++ standards and richer standard libraries, C++ is a good candidate to quickly write throw-away code. The integration of C++ in org-babel gets C++ even closer to the tasks for which scripting languages are used for.

Future work will include learning to use the C++ standard library (data structures and algorithms), browsing the catalog of BOOST libraries, trying to get good C++ developers to use emacs and C++11, and find a generic variadic-template-based solution to world hunger.



Ben Klemens, Modeling with Data: Tools and Techniques for Scientific Computing, Princeton University Press, 2009, ISBN: 9780691133140.


Well, this is just trolling, since Java is in the list of supported languages, and I guess that you can use gcc for Objective-C.


This may happen if you are writing code for a nuclear power plant.


To guide nuclear missiles towards the nuclear power plant of the evil guys, for instance.


Just send some nuclear bombs to those who are hungry?


Isn't that cool, to be on the same cloud as all those Javascript programmers?


But tell it to make coffee first.


Eric Schulte, Dan Davison, Tom Dye, Carsten Dominik. A Multi-Language Computing Environment for Literate Programming and Reproducible Research Journal of Statistical Software

Tags: programming cpp linux emacs org-mode
18 Mar 2015

Data science in C?

Coursera is offering a Data Science specialization taught by professors from Johns Hopkins. I discovered it via one of their courses which is about reproducible research. I have been watching some of the video lectures and they are very interesting, since they combine data processing, programming and statistics.

The courses use the R language which is free software and is one the reference tools in the field of statistics, but it is also very much used for other kinds of data analysis.

While I was watching some of the lectures, I had some ideas to be implemented in some of my tools. Although I have used GSL and VXL1 for linear algebra and optimization, I have never really needed statistics libraries in C or C++, so I ducked a bit and found the apophenia library2, which builds on top of GSL, SQLite and Glib to provide very useful tools and data structures to do statistics in C.

Browsing a little bit more, I found that the author of apophenia has written a book "Modeling with data"3, which teaches you statistics like many books about R, but using C, SQLite and gnuplot.

This is the kind of technical book that I like most: good math and code, no fluff, just stuff!

The author sets the stage from the foreword. An example from page xii (the emphasis is mine):

" The politics of software

All of the software in this book is free software, meaning that it may be freely downloaded and distributed. This is because the book focuses on portability and replicability, and if you need to purchase a license every time you switch computers, then the code is not portable. If you redistribute a functioning program that you wrote based on the GSL or Apophenia, then you need to redistribute both the compiled final program and the source code you used to write the program. If you are publishing an academic work, you should be doing this anyway. If you are in a situation where you will distribute only the output of an analysis, there are no obligations at all. This book is also reliant on POSIX-compliant systems, because such systems were built from the ground up for writing and running replicable and portable projects. This does not exclude any current operating system (OS): current members of the Microsoft Windows family of OSes claim POSIX compliance, as do all OSes ending in X (Mac OS X, Linux, UNIX,…)."

Of course, the author knows the usual complaints about programming in C (or C++ for that matter) and spends many pages explaining his choice:

"I spent much of my life ignoring the fundamentals of computing and just hacking together projects using the package or language of the month: C++, Mathematica, Octave, Perl, Python, Java, Scheme, S-PLUS, Stata, R, and probably a few others that I’ve forgotten. Albee (1960, p 30)4 explains that “sometimes it’s necessary to go a long distance out of the way in order to come back a short distance correctly;” this is the distance I’ve gone to arrive at writing a book on data-oriented computing using a general and basic computing language. For the purpose of modeling with data, I have found C to be an easier and more pleasant language than the purpose-built alternatives—especially after I worked out that I could ignore much of the advice from books written in the 1980s and apply the techniques I learned from the scripting languages."

The author explains that C is a very simple language:

" Simplicity

C is a super-simple language. Its syntax has no special tricks for polymorphic operators, abstract classes, virtual inheritance, lexical scoping, lambda expressions, or other such arcana, meaning that you have less to learn. Those features are certainly helpful in their place, but without them C has already proven to be sufficient for writing some impressive programs, like the Mac and Linux operating systems and most of the stats packages listed above."

And he makes it really simple, since he actually teaches you C in one chapter of 50 pages (and 50 pages counting source code is not that much!). He does not teach you all C, though:

"As for the syntax of C, this chapter will cover only a subset. C has 32 keywords and this book will only use 18 of them."

At one point in the introduction I worried about the author bashing C++, which I like very much, but he actually gives a good explanation of the differences between C and C++ (emphasis and footnotes are mine):

"This is the appropriate time to answer a common intro-to-C question: What is the difference between C and C++? There is much confusion due to the almost-compatible syntax and similar name—when explaining the name C-double-plus5, the language’s author references the Newspeak language used in George Orwell’s 1984 (Orwell, 19496; Stroustrup, 1986, p 47). The key difference is that C++ adds a second scope paradigm on top of C’s file- and function-based scope: object-oriented scope. In this system, functions are bound to objects, where an object is effectively a struct holding several variables and functions. Variables that are private to the object are in scope only for functions bound to the object, while those that are public are in scope whenever the object itself is in scope. In C, think of one file as an object: all variables declared inside the file are private, and all those declared in a header file are public. Only those functions that have a declaration in the header file can be called outside of the file. But the real difference between C and C++ is in philosophy: C++ is intended to allow for the mixing of various styles of programming, of which object-oriented coding is one. C++ therefore includes a number of other features, such as yet another type of scope called namespaces, templates and other tools for representing more abstract structures, and a large standard library of templates. Thus, C represents a philosophy of keeping the language as simple and unchanging as possible, even if it means passing up on useful additions; C++ represents an all-inclusive philosophy, choosing additional features and conveniences over parsimony."

It is actually funny that I find myself using less and less class inheritance and leaning towards small functions (often templates) and when I use classes, it is usually to create functors. This is certainly due to the influence of the Algorithmic journeys of Alex Stepanov.



Actually the numerics module, vnl.


Apophenia is the experience of perceiving patterns or connections in random or meaningless data.


Ben Klemens, Modeling with Data: Tools and Techniques for Scientific Computing, Princeton University Press, 2009, ISBN: 9780691133140.


Albee, Edward. 1960. The American Dream and Zoo Story. Signet.


Orwell, George. 1949. 1984. Secker and Warburg.


Stroustrup, Bjarne. 1986. The C++ Programming Language. Addison-Wesley.

Tags: otb research programming books open-source free-software c cpp linux algorithms
04 Feb 2015

L'autre jour je discutais avec mon ami Gilles sur la frustration ressentie face à des services clients qui n'en sont pas. Que ce soient des avatars appelés conseillers virtuels ou des humains qui suivent une procédure préétablie pour essayer de vous donner une réponse à une question posée, il me disait que c'est pareil. Et qu'en fait, il vaudrait mieux qu'il n'y ait pas d'humain dans le système, car ça doit être désagréable pour eux de recevoir des remarques méprisantes voire des insultes de la part des clients.

Notre conversation a dérivé vers l'idée générale d'automatisation du travail pour enfin arriver au constat que les possibilités ne se limitent pas à des tâches répétitives pour lesquelles des procédures explicites peuvent être établies. Ce qui, il y a quelques années, était encore considéré comme de la science fiction ou des utilisations de l'intelligence artificielle qui restaient au niveau de la démonstration, est en train de devenir presque banal.

Ceux qui utilisent les services de Google, Twitter ou Facebook sont maintenant habitués à la reconnaissance faciale sur les photos ou l'interprétation automatique du langage naturel pour proposer des publicités ciblées.

Plus insidieux encore est le filtrage des messages reçus (priority inbox dans Gmail ou sur les flux de messages Facebook ou Twitter) qui ne présentent que ce qui est censé les intéresser en fonction d'un ensemble de critères qui leur conviennent mais qu'ils n'ont pas choisi explicitement. Cela a rappelé Gilles des idées qu'il avait eues il y a quelques années quand il participait à des commissions de choix où siégeaient des experts. Je ne dévoilerai de quoi s'agissait-il dans ces commissions, car ce qui vient pourrait être considéré comme un manque de professionnalisme de la part des experts mais aussi de Gilles.

Je n'ai pas les éléments pour défendre les experts, mais à la décharge de mon ami Gilles, je dirais qu'il était jeune et qu'il ne se rendait pas compte des enjeux de haute importance qui y étaient traités. Dans ces commissions il s'agissait d'évaluer des dossiers. Beaucoup de dossiers en peu de temps. Souvent, le panel d'experts était composé de vrais experts, très bons, mais pas du domaine concerné. Apparemment, ceci n'était pas considéré comme un problème par les méta-experts qui organisaient les commissions.

Malheureusement, cette situation transformait les évaluations en une sorte de mélange de loterie combinée à l'inquisition où chaque expert essayait de s'en sortir sans être trop ridicule.

Au bout que quelques commissions, Gilles croyait être capable d'anticiper les réactions de la plupart des experts (ils étaient des experts professionnels et donc souvent les mêmes). La quantité de dossiers à évaluer poussait les experts à faire très vite et malheureusement à bâcler le travail. La plupart fonctionnaient en pilotage automatique. Ceci désespérait Gilles. J'ai d'autres amis qui auraient eu envie de frapper certains des experts, mais l'ami dont il est question ici n'est pas un garçon violent. Il est plutôt passionné d'informatique et donc sa frustration vis-à-vis des experts s'est traduite par une envie de les remplacer par des algorithmes. En séance, il regardait les experts et voyait des logigrammes.

La frustration augmentant, ces logigrammes sont devenus du pseudo-code. Et de là, il n'a pas pu s'empêcher de franchir le pas et de coder chaque expert en séance. Il a fait de l'eXtreme programming en temps réel. Il est comme ça, mon ami : ce sont les méthodes à Gilles.

Certains des experts étaient parfaitement clairs ordonnés et élégants. Ils les remplaçait par un script Python. D'autres faisaient beaucoup attention à la forme et peu au fond des dossiers, ils devenaient des classes Ruby.

Certains, on le voyait, avaient été comme cela, mais appartenaient à une autre époque. Ils méritaient du Java. D'autres étaient, malgré tout efficaces, mais parfois difficiles à suivre. Il leur accordait du C++. D'autres étaient rapidement débordés dès qu'ils avaient plusieurs dossiers à traiter. Ils finissaient en script Matlab.

D'autres, avaient une telle mémoire des dossiers qu'ils avaient traités pendant des années, qu'il fallait les remplacer par du SQL. D'autres étaient incompréhensibles, désordonnés, embrouillés. Ils méritaient à peine un peu de Perl, voir de l'awk et il avait envie de les envoyer vers /dev/null.

Il était déçu de ne jamais avoir pu se servir du Lisp, qu'il réservait pour le jour où il croiserait un expert qui se conduirait de façon élégante, intelligente et iconoclaste, mais juste.

Gilles a beaucoup mûri professionnellement et a fait reconnaître ses compétences. D'ingénieur junior il est passé architecte, puis chef de projet. Il a récemment atteint la sublimation et a été nommé responsable technico-commercial.

Tags: fr humor programming rant
19 Aug 2012

Emacs Lisp as a general purpose language?

I live in Emacs

As you may already know if you read this blog (yes, you, the only one who reads this blog …) I live inside Emacs. I have been an Emacs user for more than 12 years now and I do nearly everything with it:

  • org-mode for organization and document writing (thanks to the export functionalities),
  • gnus for e-mail (I used to use VM before),
  • ERC and identica-mode for instant messaging and micro-blogging (things that I don't do much anymore),
  • this very post is written in Emacs and automatically uploaded to my blog using the amazing org2blog package,
  • dired as a file manager,
  • and of course all my programming tasks.

This has been a progressive shift from just using Emacs as a text editor to using it nearly as an operating system. An operating system with a very decent editor, despite of what some may claim!. It is not a matter of totalitarism or aesthetics, but a matter of efficiency: the keyboard is the most efficient user input method for me, so being able to use the same keyboard shortcuts for programming, writing e-mails, etc. is very useful.  


Looking for a Lisp

In a previous post I wrote about my thoughts on choosing a programming language for a particular project at work. My main criteria for the choice were:

  • availability of a REPL;
  • easy OTB access;
  • integrated concurrent programming constructs.

I finally chose Clojure, a language that had been on my radar for nearly a year by then. I had already started to learn the language and made some tests to access OTB Java bindings and I was satisfied with it. Since sometimes life happens, and in between there is real work too, I haven't really had the time to produce lots of code with it, and by now, I have only used it for things I would had done in Python before. That is, some glue code to access OTB, a PostGIS data base, and things like that. I also feel some kind of psychological barrier to commit to Clojure as a main language (more on this below). Added to that, has also been the recent evolution of the OTB application framework, which is very easy to use from the command line. In the last days of my vacation time, the thoughts about the choice of Clojure came back to me. I think that everything was triggered by a conversation with Tisham Dhar about his recent use of Scheme. It happened during IGARSS 2012 in Munich, and despite (or maybe thanks to?) the Bavarian beer I was analyzing pros and cons again. Actually, before starting to look at Clojure, when I was re-training myself about production systems, I had thought about GNU Guile (a particular implementation of Scheme) as an extension language for OTB applications, but finally gave up, since I thought that going the other way around was better (having the rule-based system around the image processing bits). When I thought that I was again settled with my choice of Clojure, I stumbled upon the announcement of elnode's latest version. Elnode is a web server written in Emacs Lisp. I then spent some hours reading Nic Ferrier's blog (the author of Elnode) and I saw that he advocates for the use of Emacs Lisp as a general purpose programming language. Elnode itself is a demonstration of this, of course. Therefore, I am forced to consider Emacs Lisp as an alternative.  


But what are the real needs?

  1. Availability of a REPL: well, Emacs not only has a REPL (M-x ielm), but any Emacs lisp code in any buffer can be evaluated on the fly (C-j on elisp buffers or C-x C-e in any other buffer). Actually, you can see Emacs, not as an editor, but as some sort of Smalltalk-like environment for Lisp which gets to be enriched and modified on the fly. So this need is more than fulfilled.
  2. Easy access to OTB: well, there are no bindings for Emacs lisp, but given the fact that the OTB applications are available on the command line, they can be easily wrapped with system calls which can even be asynchronous. The use of the querying facilities of the OTB application engine, should make easy this wrapping. So the solution is not so straightforward as using the Java bindings on Clojure, but seems feasible without a huge effort.
  3. Concurrency: this one hurts, since Emacs lisp does not have threads, so I don't think it is possible to do concurrency at the Lisp level. However, since much of the heavy processing is done at the OTB level, OTB multi-threading capabilities allow to cope with this.

For those who think that Emacs Lisp is only useful for text manipulation, a look at the Common Lisp extensions and to the object oriented CLOS-compatible library EIEIO may be useful.  


So why not Emacs lisp then?

The main problem I see for the use of Emacs Lisp is that I have been unable to find examples of relatively large programs using it for other things than extending Emacs. There are of course huge tools like org-mode and Gnus, but they have been designed to run interactively inside Emacs. There are other more technical limitations of Emacs Lisp which can make the development of large applications difficult, as for instance the lack of namespaces, and other little things like that. However, there are very cool things that could be done to make OTB development easier using Emacs. For instance, one thing which can be tedious when testing OTB applications on the command line is setting the appropriate parameters or even remembering the list of the available ones. It would nice to have an Emacs mode able to query the application registry and auto-complete or suggest the available options. This could consist of a set of Emacs Lisp functions using the comint mode to interact with the shell and the OTB application framework. This would also allow to parse the documentation strings for the different applications and present them to the user in a different buffer. Once we have that, it should be possible to build a set of macros to set-up a pipeline of applications and generate an Emacs Lisp source file to be run later on (within another instance of Emacs or as an asynchronous process). This would build a development environment for application users as opposed to a development environment for application developers. And since Lisp code is data, the generates Emacs Lisp code, could be easily read again and used as a staring point for other programs, but better than that, a graphical representation of the pipeline could be generated (think of Tikz or GraphViz). Yes I know, talk is cheap, show me the code, etc. I agree.  



Clojure has a thriving community and new libraries appear every week to address very difficult problems (core.logic, mimir, etc.). Added to that, all the Java libraries out there are at your fingertips. And this is very difficult to beat. I hear the criticisms about Clojure: it's just a hype, it's mainly promoted by a company, etc. Yes, I would like to be really cool and use Scheme or Haskell, but pragmatism counts. And, anyway, the programming language you are thinking of is not good enough either.  


Tags: emacs programming programming-languages
17 Nov 2011

Hack OTB on the Clojure REPL

The WHY and the WHAT

Lately, I have been playing with several configurations in order to be able to fully exploit the ORFEO Toolbox library from Clojure.

As explained in previous posts, I am using OTB through the Java bindings. Recently, the folks of the OTB team have made very cool things – the application wrappers – which enrich even further the ways of using OTB from third party applications. The idea of the application wrappers is that when you code a processing chain, you may want to use it on the command line, with a GUI or even as a class within another application.

What Julien and the others have done is to build an application engine which given a pipeline and its description (inputs, outputs and parameters) is able to generate a CLI interface and a Qt-based GUI interface. But once you are there, there is no major problem (well, if your name is Julien, at least) to generate a SWIG interface which will then allow to build bindings for the different languages supported by SWIG. This may seem a kitchen sink – usine à gaz in French – and actually it is if you don't pay attention (like me).

The application wrappers use advanced techniques which require a recent version of GCC. At least recent from a Debian point of view (GCC 4.5 or later). On the other hand, the bindings for the library (the OTB-Wrapping package) use CableSwig, which in turn uses its own SWIG interfaces. So if you want to be able to combine the application wrappers with the library bindings, odds are that you may end up with a little mess in terms of compiler versions, etc.

And this is exactly what happened to me when I had to build GCC from source in Debian Squeeze (no GCC 4.5 in the backports yet). So I started making some experiments with virtual machines in order to understand what was happening. Eventually I found that the simplest configuration was the best to sort things out. This is the HOW-TO for having a working installation of OTB with application wrappers, OTB-Wrapping and access everything from a Clojure REPL.  


Installing Arch Linux on a virtual machine

I have chosen Arch Linux because with few steps you can have a minimal Linux install. Just grab the net-install ISO and create a new virtual machine on VirtualBox. In terms of packages just get the minimal set (the default selection, don't add anything else). After the install, log in as root, and perform a system update with:

    pacman -Suy

If you used the net-install image, there should be no need for this step, but in case you used an ISO with the full system, it's better to make sure that your system is up to date. Then, create a user account and set up a password if you want to:

    useradd -m jordi
    passwd jordi

Again, this is not mandatory, but things are cleaner this way, and you may want to keep on using this virtual machine. Don't forget to add your newly created user to the sudoers file:

    pacman -S sudo
    vi /etc/sudoers

You can now log out and log in with the user name you created. Oh, yes, I forgot to point out that you don't have Xorg, or Gnome or anything graphical. Even the mouse is useless, but since you are a real hacker, you don't mind.  


Install the packages you need

Now the fun starts. You are going to keep things minimal and only install the things you really need.

    sudo pacman -S mercurial make cmake gcc gdal openjdk6 \
                   mesa swig cvs bison links unzip rlwrap

Yes, I know, Mesa (OpenGL) is useless without X, but we want to check that everything builds OK and OTB uses OpenGL for displaying images.

  • Mercurial is needed to get the OTB sources, cvs is needed for getting CableSwig
  • Make, cmake and gcc are the tools for building from sources
  • Swig is needed for the binding generation, and bison is needed by CableSwig
  • OpenJDK is our Java version of choice
  • Links will be used to grab the Clojure distro, unzip to extract the corresponding jar file and rlwrap is used by the Clojure REPL.

And that's all!  


Get the source code for OTB and friends

I prefer using the development version of OTB, so I can complain when things don't work.

    hg clone
    hg clone

CableSwig is only available through CVS.

    cvs -d \
            co CableSwig



Start building everything

We will compile everything on a separate directory:

    mkdir builds
    cd builds/


OTB and OTB-Wrapping

We create a directory for the OTB build and configure with CMake

    mkdir OTB
    cd OTB
    ccmake ../../OTB

Don't forget to set to ON the application build and the Java wrappers. Then just make (literally):


By the way, I have noticed that the compilation of the application engine can blow up your gcc if you don't allocate enough RAM for your virtual machine. At this point, you should be able to use the Java application wrappers. But we want also the library bindings so we gon on. We can now build CableSwig which will be needed by OTB-Wrapping. Same procedure as before:

    cd ../
    mkdir CableSwig
    cd CableSwig/
    ccmake ../../CableSwig/

And now, OTB-Wrapping. Same story:

    cd ../
    mkdir OTB-Wrapping
    cd OTB-Wrapping/
    ccmake ../../OTB-Wrapping/

In the cmake configuration, I choose only to build Java, but even in this case a Python interpreter is needed. I think that CableSwig needs it to generate XML code to represent the C++ class hierarchies. If you did not install Python explicitly in Arch, you will have by default a 2.7 version. This is OK. If you decided to install Python with pacman, you will have both, Python 2.7 and 3.2 and the default Python executable will point to the latter. In this case, don't forget set the PYTHON\EXECUTABLE in CMake to /usr/bin/python2. Then, just make and cd to your home directory.


And you are done. Well not really. Right now, you can do Java, but what's the point? You might as well use the C++ version, right?  



Land of Lisp

Since Lisp is the best language out there, and OTB is the best remote sensing image processing software (no reference here, but trust me), we'll do OTB in Lisp.  


You may want to get some examples that I have gathered on Bitbucket.

    mkdir Dev
    cd Dev/
    hg clone

PWOB stands for Playing With OTB Bindings. I have only put there 3 languages which run on the JVM for reasons I stated in a previous post. You will of course avoid the plain Java ones. I have mixed feelings about Scala. I definetly love Clojure since it is a Lisp.  


Get Clojure

The cool thing about this Lisp implementation is that it is contained into a jar file. You can get it with the text-based web browser links:


Go to the download page and grab the latest release. It is a zip file which contains, among other things the needed jar. You can unzip the file:

    mkdir src
    mv src/
    cd src/

Copy the jar file to the user .clojure dir:

    cd clojure-1.3.0
    mkdir ~/.clojure
    mv clojure-1.3.0.jar ~/.clojure/

Make a sym link so we have a clojure.jar:

    ln -s /home/inglada/.clojure/clojure-1.3.0.jar /home/inglada/.clojure/clojure.jar

And clean up useless things

    cd ..
    rm -rf clojure-1.3.0*



Final steps before hacking

Some final minor steps are needed before the fun starts. You may want to create a file for the REPL to store completions:

    touch ~/.clj_completions

In order for Clojure to find all the jars and shared libraries, you have to define some environment variables. You may choose to set them into your .bashrc file:

    export LD_LIBRARY_PATH="~/builds/OTB-Wrapping/lib/:~/builds/OTB/bin/"
    export ITK_AUTOLOAD_PATH="~/builds/OTB/bin/"

PWOB provides a script to run a Clojure REPL with everything set up:

    cd ~/Dev/pwob/Clojure/src/Pwob

Now you should see something like this:

    Clojure 1.3.0

Welcome to Lisp! If you want to use the applications you can for instance do:

    (import '(org.otb.application Registry))
    (def available-applications (Registry/GetAvailableApplications))

In the PWOB source tree you will find other examples.  




Final remarks

Using the REPL is fun, but you will soon need to store the lines of code, test things, debug, etc. In this case, the best choice is to use SLIME, the Superior Lisp Interaction Mode for Emacs. There are many tutorials on the net on how to set it up for Clojure. Search for it using DuckDuckGo. In the PWOB tree (classpath.clj) you will find some hints on how to set it up for OTB and Clojure. A simpler config for Emacs is to use the inferior lisp mode, for which I have also written a config (otb-clojure-config.el). I may write a post some day about that. Have fun!

Tags: arch-linux clojure otb programming
20 May 2011

Lambda Omega Lambda

In my current research work, I am investigating the possibility of integrating domain expert knowledge together with image processing techniques (segmentation, classification) in order to provide accurate land cover maps from remote sensing image time series. When we look at this in terms of tools (software) there are a set of requirements which have to be met in order to be able to go from toy problems to operational processing chains. These operational chains are needed by scientists for carbon and water cycle studies, climate change analysis, etc. In the coming years, the image data needed for this kind of near-real-time mapping will be available through Earth observation missions like ESA's Sentinel program. Briefly, the constraints that the software tools for this task have are the following.

  1. Interactive environment for exploratory analysis
  2. Concurrent/parallel processing
  3. Availability of state of the art, validated image processing algorithms
  4. Symbolic, semantic information processing

This long post / short article describes some of the thoughts I have and some of the conclusions I have come to after a rather long analysis of the tools available.

Need for a repl

The repl acronym stands for Read-Evaluate-Print-Loop and I think has its origins in the first interactive Lisp systems. Nowadays, we call this an interactive interpreter. People used to numerical and/or scientific computing will think about Matlab, IDL, Octave, Scilab and what not. Computer scientists will see here the Python/Ruby/Tcl/etc. interpreters. Many people use nowadays Pyhton+Numpy or Scipy (now gathered in Pylab) in order to have something similar to Matlab/IDL/Scilab/Octave, but with a general purpose programming language. What a repl allows is to interactively explore the problem at hand without going through the cycle of writing the code, compiling, running etc. The simple script that I wrote for the OTB blog using OTB's Python bindings could be typed on the interpreter and bring together OTB and Pylab.  


Need for concurrent programming

Without going into the details and differences between parallel and concurrent programming, it seems clear that Moore's law can only continue to hold through multi-core architectures. In terms of low-level (pixel-wise) processing, OTB provides an interesting solution (multi-threaded execution of filters by dividing images into chunks). This approach can be generalized to GPUs. However, sometimes an algorithm needs to operate on the whole image because the image splitting affects the results. This is typically the case for Markovian approaches to filtering or methods for image segmentation. For this cases, one way to speed up things is to process several images in parallel (if the memory footprint allows that!). On way of trying to maximize the use of all available computing cores in Python is using the multiprocessing module which allows to deal with a pool of threads. One example would be as follows: However, this does not allow for easy inter-thread communication which is not needed in the above example, but can be very useful if the different processes are working on the same image: imagine a multi-agent system where classifiers, algorithms for biophysical parameter extraction, data assimilation techniques, etc. work together to produce an accurate land cover map. They may want to communicate in order to share information. As far as I understand, Python has some limitations due to the global interpreter lock. Some languages as for instance Erlang offer appropriate concurrency primitives for this. I have played a little bit with them in my exploration of 7 languages in 7 weeks. Unfortunately, there are no OTB bindings for Erlang. Scala has copied the Erlang actors, but I didn't really got into the Scala thing.



Need for OTB access


This one here shouldn't need much explanation. Efficient remote sensing image processing needs OTB. Period. I am sorry. I am rather biased on that! I like C++ and I have no problem in using it. But there is no repl, one needs several lines of typedef before being able to use anything. This is the price to pay in order to have a good static checking of the types before running the problem. And it's damned fast! We have Python bindings which allows us to have a clean syntax, like in the pipeline example of the OTB tutorials. However, the lack of easy concurrency is a bad point for Python. Also, the lack of Artificial Intelligence frameworks for Python is an anti-feature. Java has them, but Java has no repl and look at its syntax. It's worse than C++. You have all these mangled names which were clean in C++ and Python and become things like otbImageFileReaderIUS2. Scala, thanks to its interoperability with Java (Scala runs in the JVM), can use OTB bindings. Actually, we have a cleaner syntax than Java's: There is still the problem of the mangled names, but with some pattern matching or case classes, this should disappear. So Scala seems a good candidate. It has:

  1. Python-like syntax (although statically typed)
  2. Concurrency primitives
  3. A repl

Unfortunately, Scala is not a Lisp. Bear with me.  


A Lisp would be nice

I want to build an expert system, it would be nice to have something for remote sensing similar to Wolfram Alpha. We could call it Ω Toolbox and keep the OTB name (or close). Why Lisp? Well I am not able to explain that here, but you can read P. Norvig's or P. Graham's essays on the topic. If you have a look at books like PAIP or AIMA, or systems like LISA, CLIPS or JESS, they are either written in Lisp or the offer a Lisp-like DSLs. I am aware of implementations of the AIMA code in Python, and even P. Norvig himself has reasons to have migrated from Lisp to Python, but as stated above, Python seems to be out of the game for me. The code is data philosophy of Lisp is, as far as I understand it, together with the repl tool, one of the main assets for AI programming. Another aspect which is also important is the functional programming paradigm used in Lisp (even though other programming paradigms are also available in Lisp). Concurrency is the main reason for the upheaval of functional languages in recent years (Haskell, for instance). Even though I (still) don't see the need for pure functional programming for my applications, lambda calculus is elegant and interesting. Maybe λ Toolbox should be a more appropriate name?  



If we recap the needs:

  1. A repl (no C++, no Java)
  2. Concurrency (no Python)
  3. OTB bindings available (no Erlang, no Haskell, no Ruby)
  4. Lisp (none of the above)

there is one single candidate: Clojure. Clojure is a Lisp dialect which runs on the JVM and has nice concurrency features like inmutability and STM and agents. And by the way, OTB bindings work like a charm: Admittedly, the syntax is less beautiful than Python's or Scala's, but (because!) it's a Lisp. And it's a better Java than Java. So you have all the Java libs available, and even Clojure specific repositories like Clojars. A particular interesting project is Incanter which provides is a Clojure-based, R-like platform for statistical computing and graphics. Have a look at this presentation to get an overview of what you can do at the Clojure repl with that. If we bear in mind that in Lisp code is data and that Lisp macros are mega-powerful, one could imagine writing things like:

    (make-otb-pipeline reader gradientFilter thresholdFilter writer)

Or even emulating the C++ template syntax to avoid using the mangled names of the OTB classes in the Java bindings (using macros and keywords):

    (def filter (RescaleIntensityImageFilter :itk (Image. : otb :Float 2)
                                                  (Image. : otb :UnsignedChar 2)))

instead of

    (def filter (itkRescaleIntensityImageFilterIF2IUC2.))

I have already found a cool syntax for using the Setters and Getters of a filter using the doto macro (see line 19 in the example below):


I am going to push further the investigation of the use of Clojure because is seems to fit my needs:

  1. Has an interactive interpreter
  2. Access to OTB (through the Java bindings)
  3. Concurrency primitives (agents, STM, etc.)
  4. It's a lisp, so I can easily port existing rule-based expert systems.

Given the fact that this would be the sum of many cool features, I think I should call it Σ Toolbox, but I don't like the name. The mix of λ calculus and latex Ω Toolbox, should be called λΩλ, which is LoL in Greek.  


Tags: clojure lisp otb programming
19 Apr 2011

Functional Python

Small things can make you smile when you have the Aha! moment. It seems that these few days of 7li7w are starting to have real effects in the way I think when programming. Today, I was faced to a simple but annoying problem. I had about 60 Formosat 2 acquisitions that I needed to use. These images have been pre-processed (thank Olivier!) and, because of historical reasons they are presented like this:


That is, a single ENVI header file and the 4 bands in one file each. And the same thing for every single acquisition date. I wanted to have the list of all the acquisition dates (the 20060206 above). Since this was not a one shot operation, I wanted something a little bit more generic than opening the directory containing the images in an Emacs Dired buffer and copying and pasting the dates, so I have used Python. The thing is solved in a single line of code:

    dates = [fil.split('_')[1] for fil in os.listdir(imageDir) 

I use a list comprehension, where for each file in the imageDir directory (os.listdir), if the extension is hdr I keep the second chunk of the name of the file when _ is used as a separator.

Tags: programming python
16 Apr 2011

7 languages in 7 weeks

Some months ago I found a book of the Pragmatic Bookshelf which seemed intriguing to me: Seven Languages in Seven Weeks: A Pragmatic Guide to Learning Programming Languages by Bruce A. Tate.

I say that it was intriguing, because I had read not long before a very interesting essay by Peter Norvig entitled Teach Yourself Programming in Ten Years were he is very critical of the current tendency to be in a rush for learning any new skill and mostly for computer programming languages.

Since I was very much persuaded by P. Norvig's arguments, the 7li7w book upset me a little bit: I am a big fan of the Pragmatic Programmers and I couldn't believe that they were going this road! But I finally picked the book and went through quickly and found it very appealing. B. Tate, the author is very clear from the beginning: you can not pretend to learn all those languages in a week each.

The goal of the book is to present the things that make each one of these languages interesting (the paradigms, the typing, etc.) and make the reader feel the strengths and weaknesses of each one of them. Unfortunately, I didn't go further on the exploration and I forgot about the book. Several weeks ago, I found the book unexpectedly looking at me and I decided to give it a try.

The first chapter about Ruby went smoothly. I have been using Python since early 2000 and last year I played a little bit with Squeak, a version of Smalltalk. I find Ruby to be somewhere in the middle of these two languages.

When I started the chapter on Io, I realized that I had to stretch my neurons a little bit more. I started to fear that the motivation could disappear soon, so I asked for help. I e-mailed Emmanuel Christophe (a former work colleague with whom we have sometimes fun together programming and talking about crazy ideas) in order to ask him to monitor my work on the book.

As I should have expected, he proposed to join me since he also had started to read the book and never finished. So we did Io, Prolog and we are now on Scala. The source code of our project is hosted on bitbucket and available for anyone who may be interested.

Tags: books programming
26 Feb 2011

HDF4 support for GDAL on Arch Linux

I have been trouble reading HDF files with OTB on Arch Linux for a while. I finally took the time to investigate this problem and come to a solution. At the beginning I was misled by the fact that I was able to open a HDF5 file with Monteverdi on Arch. But I finally understood that the GDAL pre-compiled package for Arch was only missing HDF4 support. It is due to the fact that HDF4 depends on libjpeg version 6, and is incompatible with the standard current version of libjpeg on Arch. So the solution is to install libjpeg6 and HDF4 from the AUR and then regenerate the gdal package, who, during the configuration phase will automatically add HDF4 support. Here are the detailed steps I took:

Install libjpeg6 from AUR:

mkdir ~/local/packages
cd ~/local/packages
tar xzf libjpeg6.tar.gz
cd libjpeg6
makepkg -i #will prompt for root psswd for installation

Install HDF4 from AUR:

cd ~/local/packages
wget [[]]
tar xzf hdf4-nonetcdf.tar.gz
cd hdf4-nonetcdf
makepkg -i #will prompt for root psswd for installation

Setup an Arch Build System build tree

sudo pacman -S abs
sudo abs
mkdir ~/local/abs

Compile gdal

cp -r /var/abs/community/gdal ~/local/abs
makepkg -s #generated the package without installing it
sudo pacman -U gdal-1.8.0-2-x86_{64}.pkg.tar.gz

If you are new to using AUR and makepkg, please note that the AUR package installation need the sudo package to be installed first (the packages are build as a non root user and sudo is called by makepkg). Step 3 above is only needed if you have never set up the Arch Build System on your system.

Tags: programming remote-sensing
Other posts
Creative Commons License by Jordi Inglada is licensed under a Creative Commons Attribution-ShareAlike 4.0 Unported License. RSS. Feedback: info /at/ Mastodon