Weekend Reading Round-Up

Hot off the press! Articles on deep learning, early-warning signs of critical transitions, recommendation engines, support vector machines, and music.

read on...

Weekend Reading Round-Up

How easy is it to deceive a deep neural network? Does gender of leaders affect team cohesion? Can music be classified by looking at entropy alone?

read on...

Weekend Reading Round-Up

In this month’s reading round-up I look at businesses in Africa, category theory and Scala, fancy copy-pasting of code, neuromorphic microchips, machine learning, philosophy, supercomputers, topology, and of course data.

read on...

Weekend Reading Round-Up

Below I have collected an initial batch of recent research articles and posts on various topics, such as deep learning, graphs, music, and Scala, that may be of interest to readers of Databaseline.

read on...

Blockchains and Music

There are two crucial pieces of information for each song: who performed it (performing rights) and who owns the rights to it (copyrights). As of today, there is no central system that maintains this information. Is this really such a problem? Well, Spotify had to settle a $30m lawsuit a while ago because they had no idea whom to pay. There is also an infamous case in which 107% of the rights to a single song were sold. So, yes, it’s a pretty big deal. That’s why I want to take a look at blockchains as a possible solution.

read on...

How Do Musicians Make Money? (Part 2/2)

2016 was a good year for the music business: revenue from streaming and digital downloads exceeded that of physical sales for the first time in the history of the industry. On the one hand, companies such as Spotify and Apple have made digital music easily accessible and legal. On the other hand, each stream only pays artists fractions of a cent, which means that all but the most popular artists make very little money from streaming. Fair compensation for artists in a digital world is a recurring topic in media and industry. In this article I want to look at the ways professional musicians make money now.

read on...

How Do Musicians Make Money? (Part 1/2)

For most of human history, music has been composed and performed by amateurs, enslaved individuals, and professionals in direct employment of the nobility. Music as a full-time career option without some form servitude has only been a fairly recent phenomenon. Before we shall dive into the complexities of the music industry and how musicians make money now, let’s go back in time and see how we got here.

read on...

Shell Scripts to Check Data Integrity in Hive

Apache Hive does not come with an out-of-the-box way to check tables for duplicate entries or a ready-made method to inspect column contents, such as for instance R’s summary function. In this post I shall show a shell scripts replete with functions to do exactly that: count duplicates and show basic column statistics. These two functions are ideal when you want to perform a quick sanity check on the data stored in or accessible with Hive.

read on...

A Quickie on Collecting Table and Column Statistics in MemSQL

MemSQL is a distributed in-memory database that is based on MySQL. As of the latest version (5.7), MemSQL does not automatically collect table (i.e. all columns) and column (i.e. range) statistics. These statistics are important to the query optimizer. Here I’ll present a lightweight shell script that collects table and column (i.e. range) statistics based on a configuration file.

read on...

How to Stay Up to Date with Trends in Tech

As technology evolves at a rapid rate, it may sometimes be difficult to keep up. While I am certainly not the world’s leading expert on anything, I thought I’d share how I keep abreast of the latest developments in the industry.

read on...

A Glossary of Some Software Terminology

Continuous integration, Docker, Jenkins, Vagrant, DevOps, PaaS, serverless… If you are sometimes confused as to what the latest buzzwords in the tech industry mean, you have come to the right place

read on...

The Problems with Visual Programming Languages in Data Engineering

Recently I had a conversation about the value proposition of visual programming languages, especially in data engineering. Drag-and-drop ETL tools have been around for decades and so have visual tools for data modelling. Most arguments in favour of or against visual programming languages come down to personal preference. In my opinion there are, however, three fairly objective reasons why visual tools are a bad idea in data engineering.

read on...

Shell Scripts to Ease Spark Application Development

When creating Apache Spark applications the basic structure is pretty much the same: for sbt you need the same build.sbt, the same imports, and the skeleton application looks the same. All that really changes is the main entry point, that is the fully qualified class. Since that’s easy to automate, I present a couple of shell scripts that help you create the basic building blocks to kick-start Spark application development and allow you to easily upgrade versions in the configuration.

read on...

An Overview of Apache Streaming Technologies

There are many technologies for streaming data: simple event processors, stream processors, and complex event processors. Even within the open-source community there is a bewildering amount of options with sometimes few major differences that are not well documented or easy to find. That’s why I’ve decided to create an overview of Apache streaming technologies, including Flume, NiFi, GearpumpApex, Kafka StreamsSpark Streaming, Storm (and Trident), Flink, Samza, Ignite, and Beam.

read on...

ETL: A Simple Package to Load Data from Views

A common, native way to load data into tables in Oracle is to create a view to load the data from. Depending on how the view is built, you can either refresh (i.e. overwrite) the data in a table or append fresh data to the table. Here, I present a simple package ETL that only requires you to maintain a configuration table and obviously the source views (or tables) and target tables.

read on...

The Case for Industrial Data Science

It has – perhaps somewhat prematurely – been called the sexiest job of the twenty-first century, but whether you buy into the Big Data hype or not, data science is here to stay.

The available literature, the majority of courses in both the virtual and real world, and the media all purport the image of the data science ‘artiste’: a data bohemian who lives among free, like-minded spirits in lofty surroundings, who receives sacks of money in exchange for genuine works of art created with any possible ‘cool’ tool that flutters by in whatever direction the wind is blowing that day.

The reality for many in the field is quite different. Corporations rarely grant anyone unfettered access to all data, and similarly they are not willing to try and buy every new tool that hits the market, simply to satisfy someone’s curiosity. Furthermore, industrial data science has requirements that are much stricter than what is commonly taught in programmes around the world, and it’s time to make the case for industrial data science.

read on...

Setting up Scala for Spark App Development

Apache Spark is a popular framework for distributed computing, both within and without the Hadoop ecosystem. Spark offers interactive shells for Scala as well as Python. Applications can be written in any language for which there is an API: Scala, Python, Java, or R. Since it can be daunting to set up your environment to begin developing applications, I have created a presentation that gets you up and running with Spark, Scala, sbt, and ScalaTest in (almost) no time.

read on...

Mapping a Value Stream in Neo4j

The canonical use cases of a graph database such as Neo4j are social networks. In logistics and manufacturing networks also arise naturally. In particular, supply chains and value streams spring to mind. They may not be as large as Facebook’s social graph of all its users, but seeing them for the beasts they truly are can be beneficial. In this post I therefore want to talk about how you can model a value stream in Neo4j and how you can extract valuable information from it.

read on...

A Quickie on Connecting to Oracle Database VM From A Host

Although the pre-built Oracle Database 12c VMs come with Oracle SQL Developer and APEX, you may not want to leave the host environment and develop in the virtual machine (guest). Sure, you can set up a shared folder and enable bi-directional copy-paste functionality thanks to the so-called Guest Additions, but it’s not the same as working in your own host OS.

In this post I describe how you can connect from the host to the guest on which the VM resides with a few simple tweaks. I have also included a simple installation overview of SQL Developer for Ubuntu.

read on...

Checking Data Type Consistency in Oracle

In large databases it can be a challenge to have data type consistency across many tables and views, especially since SQL does not understand PL/SQL’s %TYPE attribute. When designing the overall structure of the tables, tools such as SQL Developer’s Data Modeller can be used to reduce the pain associated with potential data type inconsistencies. However, as databases grow and evolve, data types may diverge and cause headaches when moving data back and forth. Here I present a utility to identify and automatically fix many of these issues.

read on...

An Overview of PL/SQL Collection Types

Collections are core components in Oracle PL/SQL programs. You can (temporarily) store data from the database or local variables in collections and pass these collections to subprograms. Collections are also critical to bulk operations, such as BULK COLLECT and FORALL, as well as table functions, both simple and pipelined. Bulk operations and table functions are critical to high-performance code.

Here, I provide an overview of their characteristics as an introduction to novice PL/SQL developers and as an one-stop reference.

read on...

Why Govern Your Data?

The way a company looks at its data is indicative of its readiness to embrace a data governance programme: is data a by-product of doing business or an asset that requires attention and resources? One of the key questions with data governance is, ‘Why?’

Why should you govern your data? What’s the benefit?

read on...

Oracle SQL and PL/SQL Coding Guidelines

Coding standards are important because they reduce the cost of maintenance. To enable database developers on the same team to read one another’s code more easily, and to have consistency in the code produced and to be maintained, I have prepared a set of coding conventions for Oracle SQL and PL/SQL. These are by no means the be-all and end-all of Oracle Database standards, and in some instances you may not agree with the conventions I have proposed. That’s why I have created an easy-to-share, easy-to-edit Markdown document with these guidelines, including a snazzy CSS3 style sheet, in my Bitbucket repository. You can adapt these guidelines for your organization’s needs as you see fit; an attribution would be grand but I won’t sue you if you’re dishonest.

read on...

Unit Testing PL/SQL Code?

In almost all areas of software development, unit testing is not only common sense but also common practice. After all, hardly any serious software vendor would dare ship applications without having properly tested their functionality. When it comes to databases, many organizations still live in the Dark Ages. With Oracle SQL Developer there is absolutely no reason to remain in the dark: unit testing PL/SQL components is easy, free, and fully integrated into the IDE.

read on...

Tuning Distributed Queries in Oracle

When it comes to SQL statements and optimizing queries on relational databases, probably the first thing developers (ought to) look at is the execution plan. The execution plan shows you what the database engine thinks is the best way to execute a query and it gives estimates of relevant runtime indicators that influenced the optimizer’s decision.

When a query involves calls to remote databases you may not always get the best execution (plan) available, because Oracle always runs the query on the local database as it has no way of estimating the cost of network traffic and thus no way of weighing the pros and cons of running your query remotely versus locally. Many tips and tricks have been noted by gurus and of course Oracle, but I was recently asked to tune a query than involved more than the textbook cases typically shown online.

read on...

Frustration: The Data Governance Sinkhole

Plentiful are the companies that revel in fancy descriptions of data-driven decision-making cultures on their corporate websites. Scarce are they who actually have a data governance office to back up these grand claims, for any data-centred programme without clear definitions and business processes regarding data is doomed to fail.

What is less known about data governance is that there is a phase during which companies run a risk of losing their best and brightest because of inaction or even worse: wrong actions. Michael Lopp has written an excellent article on why bored engineers quit, and as bad as that may be in general situations it is disastrous in the early phases of a data governance programme.

read on...

Searching The Oracle Data Dictionary

Databases and especially data warehouses typically consist of many dozens of tables and views. Good documentation is essential but even the best documentation cannot answer your questions as quickly as you want the information.

read on...

Numerical Algorithms: Variational Integrators

When I was rummaging through my digital attic I found code that I had worked on a few years ago. It is not related to data or database technologies but it is in itself interesting, so that I thought I’d better share it.

To calculate integrals in mathematics or solve differential equations in physics or chemistry you often need a computer’s help. Only in very rare cases can you express the integral symbolically. In most instances you have to be satisfied with a numerical solution, which for many problems is perfectly fine.

read on...

How to Multiply Across a Hierarchy in Oracle: Performance (Part 2/2)

In the first part I have shown several options to calculate aggregation functions along various branches of a hierarchy, where the values of a particular row depend on the values of its predecessors, specifically the current row’s value is the product of the values of its predecessors.

Performance was the elephant in the room. So, it’s time to take a look at the relative performance of the various alternatives presented.

read on...

How to Multiply Across a Hierarchy in Oracle: SQL Statements (Part 1/2)

Hierarchical and recursive queries pop up from time to time in the life of a database developer. Complex hierarchies are often best handled by databases that are dedicated to such structures, such as Neo4j, a popular graph database. Most relational database management systems can generally deal with reasonable amounts of hierarchical data too. The classical example of hierarchical queries in SQL is the employees table: construct the organization chart with the CEO sitting at the top of the tree and all employees dangling from the branches on which their respective managers have placed their bottoms. While a direct acyclic graph of all company talent may tickle your fancy, I very much doubt it.

read on...

Homelessness: A Problem with Data but Few Solutions

A while ago I had the pleasure to visit Strasbourg, the capital of the Alsace region and home of the European parliament. Strasbourg is a lovely city to spend a few days, go shopping, enjoy French and Alsatian cuisine, and take in the scenery, most notably the picturesque Petite France area with its half-timbered houses and restaurants serving local delicacies along the river Ill.

The story I want to tell has very little to do with Strasbourg though. It is about homelessness, statistics on homelessness, and more to the point: the lack of action by our governments to solve the problem of homelessness.

read on...

The Data-Quality Doctor

When it comes to data quality, many companies are still stuck in firefighting mode: whenever a problem arises, a quick fix is introduced, often without proper documentation, after which the problem is filed under ‘H’ for “hopefully not my problem the next time around”. The patches are often temporary, haphazard, and cause more problems downstream than they really solve. In most instances there is no obvious reason for it, which makes a bad situation worse. It reminds me of the saying that the treatment is worse than the disease, so it’s time to bring in the doctor…

read on...

The Football Fortune Tellers

It was of course a public relations gimmick by Goldman Sachs and PricewaterhouseCoopers to ‘predict’ the outcome the 2014 FIFA World Cup. Their best and brightest looked into their crystal footballs at historical data and created statistical models that foresaw what has recently transpired.

Let’s take a look at what they claimed before the World Cup and what really happened in Brazil in order to answer the question: Were they right?

read on...

The Challenges of Data Integration: Data Governance (Part 4/4)

The fourth and final part of the series on the challenges of data integration is about data governance. Data governance is not so much a challenge as it is a critical component of continued success. Data integration is usually only the Band-Aid that is applied to a particular business problem. Beyond that it has the power to transform a business, but to do so you need to continuously guard, monitor, analyse, and improve your data and related business processes, so that the information you glean from it is always sound.

read on...

The Challenges of Data Integration: People (Part 3/4)

The third part of the four-part series on the challenges of data integration deals with people. I have already hinted at a few people issues in the first and second parts on technical and project management challenges, respectively, but I have not gone into specifics.

Where people work together there will be conflicts. As we shall see, data integration projects can be particularly tricky, as they require dirty data to be ‘smuggled’ over organizational borders into enemy territory.

read on...

The Challenges of Data Integration: Project Management (Part 2/4)

In this second post of a four-part series on the challenges of data integration I want to talk about project management. Data integration is, as I have said before, not simply a matter of throwing technical people at a business problem. Not literally of course: most people do not like being flung at things, abstract or concrete, but probably at the latter a bit less than at the former.

Project management is the key to your success. Sure, you need able people to build the data warehouse, but without a solid foundation in project management your project will tip over at the slightest sigh. And please take it from someone who has been there, done that, got the T-shirt, and has outgrown it: there will be a lot of sighs during the project, even full-blown tornadoes… To weather any storm, you and the entire organization have to live project management practices. Project management is not the silver bullet, but it can protect you against the most common enemies: no idea, no plan, no back-up plan, and no support.

read on...

The Challenges of Data Integration: Technical (Part 1/4)

Data integration is a formidable challenge. For one, data integration is never the goal of an organization. Similarly, a data warehouse is never the objective. It is merely a vehicle that can drive you to your destination. Data storage and integration for data’s sake are a waste of time, money, resources, and nerves. Without a clear business case, effective leadership, and strong support, including but not limited to a highly visible and respected sponsor, any data integration project is doomed from the get-go.

read on...