The Problems with Visual Programming Languages in Data Engineering
Recently I had a conversation about the value proposition of visual programming languages, especially in data engineering. Drag-and-drop ETL tools have been around for decades and so have visual tools for data modelling. Most arguments in favour of or against visual programming languages come down to personal preference. In my opinion there are, however, three fairly objective reasons why visual tools are a bad idea in data engineering.
Before I proceed I want to make it absolutely clear that I’m only talking about visual programming languages (VPLs) in data engineering. There are domains where I think VPLs are valuable, most notably education (e.g. teaching children to code), game development, multimedia (e.g. DSP and 3D graphics rendering), simulations, BPM workflows, and to some extent automation too.
The main gripes with visual tools in data engineering are:
- Data governance
These three problems belong to three separate tags: technology, organization, and ergonomics. Let’s take a look at each category, explore the pros and cons, and see which ones are fundamental to VPLs and which ones are not.
For more discussions and background information, please check out the following sources:
Anyone with a software development background who has ever dealt with visual ETL tools may have marvelled at the lack of proper version control and diff tools that go with it. Some tools come with their own built-in VCS, while others allow you to use any or no VCS at all. The difficulty lies in the fact that the visual representation is often stored as an XML (or JSON) file. So, if a box is moved by 1 pixel, the file is different. You could argue that it’s indeed different because the layout is different, but you could equally make the case that the logic has not changed. This argument is moot though: it is technically possible to ensure that the tool auto-aligns blocks and routes/colours arrows, very much like yEd does (via menu items). Some users may not be happy with the reduced control over the way the flow looks, but others may rejoice that version control has become usable.
ETL (and ORM) tools often auto-generate code that is not particularly tuned for the data source in question. I have encountered many odd nested loops where simple hash joins would have been more appropriate if only the predicates had been pushed down properly (and if only the tool had evaluated blocks lazily). Aggregations and timestamp-based filters are also often a cause for performance issues. Again, performance is technically solvable, so this may be a valid argument against visual tools in data engineering now but perhaps not tomorrow.
Grepping or find-and-replace operations are easy enough on text files, but for visual programming languages they are a pain. I believe that this is also not a fundamental issue with VPLs and it can (and may) be solved (some day). Bear in mind that text-based tools have been the norm, so they are obviously more mature.
Similarly, I think the following items are current issues but they are not intrinsic to VPLs in data engineering:
- Variable scope is often murky, especially non-constant globals;
- Looping (or element-by-element mapping) is impossible or tedious;
- What goes on ‘behind the scenes’ is often unclear or downright dodgy (e.g. removing an existing column a table in SQL Developer’s Data Modeller causes the column to be hidden instead of being dropped).
In data engineering, people deal with various systems, formats, protocols, frameworks, and so on. This means that integration into the overall architecture is essential. Many tools offer a plethora of connectors, but they are all a one-way street: from the various systems into the visual tool. Whatever goes into these tools is not accessible from the outside.
Some of these issues are the consequence of crummy version control, but most of these VPLs are designed as monolithic monsters, so they are anathema to modern enterprise architectures based on event sourcing (or CQRS) and microservices.
These tools very much try to be at the core of everything without playing according to the same rules: continuous integration, unit testing frameworks, functional testing, and basic debugging and logging functionality are either completely absent or unsatisfactory because they are an extreme pain to set up. Sure, these could be developed from scratch, but the people who use these tools are often not software developers, so the demand for a full DevOps suite is not exactly high on the list.
Visual tools are marketed as ‘non-programmer friendly’. In a way that is true, but it’s also a convenient lie. After all, you are generating code and the code is run against production systems. Proper design is still necessary. The complexity can easily get out of hand without proper modularization, but I shall refrain from talking about that until the next section.
Excel is also very non-programmer friendly, but it is probably the biggest scourge in modern enterprises as it completely opposes any data governance initiative. After all, why should anyone clean up the database when an Excel sheet already does that?
VPLs in data engineering follow the exact same path: the salespersons who peddle these tools target non-programmers but they fail to mention that well tested, properly documented, re-usable and maintainable visual code still requires some programming experience. With everyone and anyone creating custom flows, there is absolutely no guarantee that people are looking at the same data, just like with Excel.
Although I believe companies can counteract such behaviour I am not confident that many have the discipline to succeed, which is why I attribute the problem to organizations with visual tools and not merely organizations as such. A decision in favour of a visual data engineering tool is an implicit repudiation of data governance.
What I find most disturbing is that visual data engineering tools are extremely keyboard unfriendly. Most people have ten fingers and with some practice can type fairly fast. Pointing-and-clicking is tedious, may cause RSI, and only uses one hand.
Sure, a possible solution to this is to introduce keyboard shortcuts to access menus and manipulate objects on the canvas or have two pointing devices. Nevertheless, this poses a serious usability issue: except for a few power users most people won’t bother with complex shortcuts or the coordination required to move two pointing devices at the same time.
Under the heading ‘maintainability’ there are a few arguments that I feel are very much true but they depend on your background and they do not describe fundamental objections to VPLs in data engineering. I mention these points merely for the sake of completeness.
With visual tools it’s easy to end up with flows with many dozens of blocks and arrows. The complexity can technically be managed by properly modularizing components, but that often requires skills beyond that of business analysts or data scientists. After all, it’s not their task to create production-grade flows, is it?
This leads to another difficulty: just because business analysts and data scientists can create flows does not mean that they ought to. Their value lies in using the data, not preparing it. That’s exactly what data engineers are good at. However, they will argue that visual tools are not their forté and they may point to the three main problems I have already identified: integration, data governance, and ergonomics.
To make matters worse, the data engineers may end up having to support such a tool anyway because they have to ‘figure out’ why it doesn’t perform as well as it is ‘supposed’ to; scalability is rarely on anyone’s radar except data engineers’. At that point the organization is too heavily invested in a tool that’s only going to cost more and more in terms of maintenance.
With maintainability obviously also comes documentation. I have heard people argue that ‘data flows are self-documenting’. That is complete and utter nonsense. Note that this is not related to a person’s background: a programmer may have more immediate insights into a Gradle-based project than a collection of ETL flows, and a business analyst may claim exactly the opposite.
What I am talking about is this: anyone who thinks that a visual representation is crystal clear has only seen sales slides. Anyone with real-world experience will know that an ETL flow with 200 blocks and a similar number of arrows and multiple levels of nesting (and external function calls) is about is clear as mud. To know what’s inside a block you must click on it. Just because the properties tab shows you that there is a filter on a particular column does not mean that you understand why it’s there. This is of course true of ‘normal’ code too, so no excuses. The idea that VPLs are automatically self-documenting is preposterous.
Most visual tools allow comment boxes to be dragged anywhere onto the canvas. This is not a very practical solution. First, it requires you to be on the right canvas or within the right file to see the relevant documentation. Second, there is no way to extract it. Technically, it’s possible to compile it into an HTML page (à la Scaladoc) but it’s not clear what should be shown: just the comment box or some high-level flow as well? The former is taken out of context, so unlikely going to be helpful, whereas the latter must be aggregated in some way. Is that a separate flow or box to check in a menu?
I have mentioned modularity before, it comes back under the banner of maintainability as well. It’s not unsolvable, but I haven’t seen a good solution for it. Key to maintainable visual data flows is modular design. However, since most users are not software developers, they typically end up repeating the same logic over and over again. To add insult to injury, these tools allow custom code to be called at will (e.g. Python or R scripts, SQL, DLLs, or JARs), which means that a simple block may be either a link to a completely new flow or call a completely untested piece of code that has to be stored on some server. Let’s not even go into details about access control and security in such a situation.
I already mentioned performance, but I want to mention it again. Because these flows are often created by business analysts or data scientists there is the risk that production systems are suddenly hit with unusual amounts of traffic, especially if the auto-generated code leads to sub-optimal code (e.g. many tools wrap YEAR(), MONTH(), or DAY() functions around timestamp columns in filters, even though it is often more advantageous to use simple inequality predicates when a timestamp column is indexed). This is both an operational and maintainability risk: if you cannot modify the auto-generated code your live systems may be impacted negatively, and if you can modify the auto-generated code you have to maintain it, which may be a lot of ‘fun’ with software upgrades.
None of the comments in this section are in my opinion a fundamental problem of VPLs for data engineering. They are, however, a critique of the current state of most tools.
A 2D canvas onto which you can drag and drop objects is a very limited space. Moreover, once you discover that you need to add a block in-between two other blocks of a flow with a few hundred blocks, you may end up with a layout that looks haphazard. Manually rearranging these blocks is a true waste of time. An auto-layout feature may be fine, but it can also make the entire layout appear new, even to its creator.
The price of flexibility is a degradation in the user experience. Many visual tools have endless menu options, tabs with confusing options and check boxes, and features whose precedence is not always obvious. Development environments that are backed by tools such as Eclipse tend to have a bewildering amount of over-engineering in them. I do admit that I’m biased against Eclipse though.
Because the tool is visual, there is by definition no standalone mode. This is often the reason people go with VPLs, but GUIs are hardly known for their low memory footprint. Since most computers are equipped with a decent amount of RAM this is not the death knell for VPLs. It just means that on connections with limited bandwidth users may experience severe latency issues.
All in all, I think many commonly voiced critiques against VPLs in data engineering are based on background, experience, and preference. Many of the issues raised are about the current state of affairs and not indicative of inherent defects in visual tools. Obviously, if the current state-of-the-art VPLs don’t cut the mustard, why use one at all?
The problem is that management may warm up to the idea of a visual tool. After all, it’s easier to present a visual representation of code than a stack of punch cards, but it is by no means the only or best way to achieve well-crafted software, as clearly evinced in the following quotation from the 1970s:
Visual programming is like trying to get to the moon by climbing a tree. You make a lot of good early progress, at first, but pretty soon you have to go back to your funding agency and ask for a bigger tree.
Nevertheless, there are three problems that are intrinsic to visual tools in data engineer though:
- the near-impossibility of integration into non-monolithic enterprise architectures and the absence of established best practices (e.g. continuous integration),
- a natural resistance to data governance programmes due to the anyone-can-be-a-data-engineer(-until-the-shit-hits-the-fan) notion, and
- the promotion of potentially unhealthy habits due to extended mouse usage.