Reading A note about open source vs proprietary tools

Last updated: January 4, 2023

With the growing popularity of machine learning, many frameworks have appeared in various languages. One of the questions you might be facing is: which tool should I choose?

Points worth considering

There are a several points you might want to consider in making that choice. For instance, what tools do people use in your field? (what tools are used in the papers you read?) What tools are your colleagues and collaborators using?

Looking a bit further into the future, whether you are considering staying in academia or working in industry could also influence your choice. If the former, paying attention to the literature is key, if the latter, it may be good to have a look at the trends in job postings.

Some frameworks offer collections of already-made toolkits. They are thus easy to start using and do not require a lot of programming experience. On the other hand, they may feel like black boxes and they are not very customizable. Scikit-learn and Keras—usually run on top of TensorFlow—fall in that category. Lower level tools allow you full control and tuning of your models, but can come with a steeper learning curve.

This course will focus on PyTorch, developed by Facebook’s AI Research lab, because of its exploding popularity in research, its ease of use while being extremely customizable, because it is really written for Python (so its syntax is easy to understand), and because it is free and open source.

The most popular machine learning library currently remains TensorFlow, developed by the Google Brain Team. While it has a Python API, its syntax can be more obscure.

Julia’s syntax is well suited for the implementation of mathematical models, GPU kernels can be written directly in Julia, and Julia’s speed is attractive in computation hungry fields. So Julia has also seen the development of many ML packages such as Flux or Knet. The user base of Julia remains quite small however.

My main motivation in writing this section however is to raise awareness about one question that should really be considered in the question of which tool to choose: whether the tool you decide to learn and use in your research is open source or proprietary.

Proprietary tools

As a student, it is tempting to have the following perspective:

"My university pays for this very expensive license. I have free access to this very expensive tool. It would be foolish not to make use of it while I can!"

When there are no equivalent or better open source tools, that might be true. But when superior open source tools exist, these university licenses are more of a trap than a gift.

Here are some of the reasons you should be wary of proprietary tools:

Researchers who do not have access to the tool cannot reproduce your methods

Large Canadian universities may offer a license for the tool, but grad students in other countries, independent researchers, researchers in small organizations, etc. may not have access to a free license (or tens of thousands of dollars to pay for it).

Once you graduate, you may not have access to the tool anymore

Once you leave your Canadian institution, you may become one of those researchers who do not have access to that tool. This means that you will not be able to re-run your thesis analyses, re-use your methods, and apply the skills you learnt. The time you spent learning that expensive tool you could play with for free may feel a lot less like a gift then.

Your university may stop paying for a license

As commercial tools fall behind successful open source ones, some universities may decide to stop purchasing a license. It happened during my years at SFU with an expensive and clunky citation manager which, after having been promoted for years by the library through countless workshops, got abandoned in favour of a much better free and open source one.

You may get locked-in

Proprietary tools often come with proprietary formats and, depending on the tool, it may be painful (or impossible) to convert your work to another format. When that happens, you are locked-in.

Proprietary tools are often black boxes

It is often impossible to see the source code of proprietary software.

Long-term access

It is often very difficult to have access to old versions of proprietary tools (and this can be necessary to reproduce old studies). When companies disappear, the tools they produced usually disappear with them. Open source tools, particularly those who have their code under version control in repositories such as GitHub, remain fully accessible (including all stages of development), and if they get abandoned, their maintenance can be taken over or restarted by others.

The licenses you have access to may be limiting and a cause of headache

For instance, Compute Canada does not have an unlimited number of MATLAB licenses. Since these licenses come with complex rules (one license needed for each node, additional licenses for each toolbox, additional licenses for newer tools, etc.), it can quickly become a nightmare to navigate through it all. You may want to have a look at some of the comments in this thread.

Proprietary tools fall behind popular open source tools

Even large teams of software engineers cannot compete against an active community of researchers developing open source tools. When open source tools become really popular, the number of users contributing to their development vastly outnumbers what any company can provide. The testing, licensing, and production of proprietary tools are also too slow to keep up with quickly evolving fields of research. (Of course, open source tools which do not take off and remain absolutely obscure do not see the benefit of a vast community.)

Proprietary tools often fail to address specialized edge cases needed in research

It is not commercially sound to develop cutting edge capabilities so specialized in a narrow subfield that they can only target a minuscule number of customers. But this is often what research needs. With open source tools, researchers can develop the capabilities that fit their very specific needs. So while commercial tools are good and reliable for large audiences, they are often not the best in research. This explains the success of R over tools such as SASS or Stata in the past decade.

Conclusion

All that said, sometimes you don't have a choice over the tool to use for your research as this may be dictated by the culture in your field or by your supervisor. But if you are free to choose and if superior or equal open source alternatives exist and are popular, do not fall in the trap of thinking that because your university and Compute Canada pay for a license, you should make use of it. It may be free for you—for now—but it can have hidden costs.

Comments & questions