Hot off the press! Articles on deep learning, early-warning signs of critical transitions, recommendation engines, support vector machines, and music.
How easy is it to deceive a deep neural network? Does gender of leaders affect team cohesion? Can music be classified by looking at entropy alone?
In this month’s reading round-up I look at businesses in Africa, category theory and Scala, fancy copy-pasting of code, neuromorphic microchips, machine learning, philosophy, supercomputers, topology, and of course data.
Below I have collected an initial batch of recent research articles and posts on various topics, such as deep learning, graphs, music, and Scala, that may be of interest to readers of Databaseline.
There are two crucial pieces of information for each song: who performed it (performing rights) and who owns the rights to it (copyrights). As of today, there is no central system that maintains this information. Is this really such a problem? Well, Spotify had to settle a $30m lawsuit a while ago because they had no idea whom to pay. There is also an infamous case in which 107% of the rights to a single song were sold. So, yes, it’s a pretty big deal. That’s why I want to take a look at blockchains as a possible solution.
If you need to read a JSON file from a resources directory and have these contents available as a basic
String or an
RDD and/or even
Dataset in Spark 2.1 (with Scala 2.11), you’ve come to the right place.
2016 was a good year for the music business: revenue from streaming and digital downloads exceeded that of physical sales for the first time in the history of the industry. On the one hand, companies such as Spotify and Apple have made digital music easily accessible and legal. On the other hand, each stream only pays artists fractions of a cent, which means that all but the most popular artists make very little money from streaming. Fair compensation for artists in a digital world is a recurring topic in media and industry. In this article I want to look at the ways professional musicians make money now.
For most of human history, music has been composed and performed by amateurs, enslaved individuals, and professionals in direct employment of the nobility. Music as a full-time career option without some form servitude has only been a fairly recent phenomenon. Before we shall dive into the complexities of the music industry and how musicians make money now, let’s go back in time and see how we got here.
Apache Hive does not come with an out-of-the-box way to check tables for duplicate entries or a ready-made method to inspect column contents, such as for instance R’s
In this post I shall show a shell scripts replete with functions to do exactly that: count duplicates and show basic column statistics.
These two functions are ideal when you want to perform a quick sanity check on the data stored in or accessible with Hive.
Interaction with HDFS via the file system shell commands or YARN’s commands is cumbersome. I have collected several helpful functions in a shell script to make life with Hadoop and YARN a tad more bearable. Here I’ll go through the salient bits.
MemSQL is a distributed in-memory database that is based on MySQL. As of the latest version (5.7), MemSQL does not automatically collect table (i.e. all columns) and column (i.e. range) statistics. These statistics are important to the query optimizer. Here I’ll present a lightweight shell script that collects table and column (i.e. range) statistics based on a configuration file.
As technology evolves at a rapid rate, it may sometimes be difficult to keep up. While I am certainly not the world’s leading expert on anything, I thought I’d share how I keep abreast of the latest developments in the industry.
Unless you have a cluster with Apache Spark installed on it at your disposal, you may want to play a bit with Spark on your own machine. The standard VMs or docker images (e.g. Cloudera, Hortonworks, IBM, MapR, Oracle) do not offer the latest and greatest. If you really want the bleeding edge of Spark, you have to install it locally yourself, roll your own Docker container, or simply use sbt.
Continuous integration, Docker, Jenkins, Vagrant, DevOps, PaaS, serverless… If you are sometimes confused as to what the latest buzzwords in the tech industry mean, you have come to the right place
Recently I had a conversation about the value proposition of visual programming languages, especially in data engineering. Drag-and-drop ETL tools have been around for decades and so have visual tools for data modelling. Most arguments in favour of or against visual programming languages come down to personal preference. In my opinion there are, however, three fairly objective reasons why visual tools are a bad idea in data engineering.
Apache Phoenix is a SQL skin for HBase, a distributed key-value store.
Phoenix offers two flavours of
but from that it may not be obvious how to update non-key columns, as you need the whole key to add or modify records.
When creating Apache Spark applications the basic structure is pretty much the same: for sbt you need the same
build.sbt, the same imports, and the skeleton application looks the same.
All that really changes is the main entry point, that is the fully qualified class.
Since that’s easy to automate, I present a couple of shell scripts that help you create the basic building blocks to kick-start Spark application development and allow you to easily upgrade versions in the configuration.
Time is important when thinking about what happens when executing operations on Spark’s
The documentation on actions is quite clear
but it doesn’t hurt to look at a very simple example
that may be somewhat counter-intuitive unless you are already familiar with transformations, actions, and laziness.
There are many technologies for streaming data: simple event processors, stream processors, and complex event processors. Even within the open-source community there is a bewildering amount of options with sometimes few major differences that are not well documented or easy to find. That’s why I’ve decided to create an overview of Apache streaming technologies, including Flume, NiFi, Gearpump, Apex, Kafka Streams, Spark Streaming, Storm (and Trident), Flink, Samza, Ignite, and Beam.
A common, native way to load data into tables in Oracle is to create a view to load the data from.
Depending on how the view is built, you can either refresh (i.e. overwrite) the data in a table or append fresh data to the table.
Here, I present a simple package
ETL that only requires you to maintain a configuration table and obviously the source views (or tables) and target tables.
It has – perhaps somewhat prematurely – been called the sexiest job of the twenty-first century, but whether you buy into the Big Data hype or not, data science is here to stay.
The available literature, the majority of courses in both the virtual and real world, and the media all purport the image of the data science ‘artiste’: a data bohemian who lives among free, like-minded spirits in lofty surroundings, who receives sacks of money in exchange for genuine works of art created with any possible ‘cool’ tool that flutters by in whatever direction the wind is blowing that day.
The reality for many in the field is quite different. Corporations rarely grant anyone unfettered access to all data, and similarly they are not willing to try and buy every new tool that hits the market, simply to satisfy someone’s curiosity. Furthermore, industrial data science has requirements that are much stricter than what is commonly taught in programmes around the world, and it’s time to make the case for industrial data science.
Apache Spark is a popular framework for distributed computing, both within and without the Hadoop ecosystem. Spark offers interactive shells for Scala as well as Python. Applications can be written in any language for which there is an API: Scala, Python, Java, or R. Since it can be daunting to set up your environment to begin developing applications, I have created a presentation that gets you up and running with Spark, Scala, sbt, and ScalaTest in (almost) no time.
Apache Hadoop is the de facto standard in Big Data platforms. It’s open source, it’s free, and its ecosystem is gargantuan. When dumping data into Hadoop, the question often arises which container and which serialization format to use. In this post I provide a summary of the various file and serialization formats.
Julia is a fairly new and promising programming language that is designed for technical computing. Here I present a cheat sheet, or rather cheat page, with the salient features.
Interested? Head on over to databaseline.bitbucket.io/julia.html.
The canonical use cases of a graph database such as Neo4j are social networks. In logistics and manufacturing networks also arise naturally. In particular, supply chains and value streams spring to mind. They may not be as large as Facebook’s social graph of all its users, but seeing them for the beasts they truly are can be beneficial. In this post I therefore want to talk about how you can model a value stream in Neo4j and how you can extract valuable information from it.
Although the date arithmetic in Oracle Database is well documented, it is not always as clear as it could be. In this blog post I want to point out a few common traps with regard to date calculations in Oracle that you should be aware of, especially with regard to intervals.
In the previous post I talked about the details of the data set for the Kaggle challenge to build a model that predicts which New York Times blog articles become popular. In this post I shall discuss how I went about to discover and add features, and build a predictive model that landed me in the top 15% of all entries.
I completed the MITx course The Analytics Edge on edX, which I can wholeheartedly recommend to anyone who is interested in analytics. As a part of the MOOC, there was a competition on Kaggle to build a predictive model to answer the question, ‘What makes a New York Times blog article popular?’
Although the pre-built Oracle Database 12c VMs come with Oracle SQL Developer and APEX, you may not want to leave the host environment and develop in the virtual machine (guest). Sure, you can set up a shared folder and enable bi-directional copy-paste functionality thanks to the so-called Guest Additions, but it’s not the same as working in your own host OS.
In this post I describe how you can connect from the host to the guest on which the VM resides with a few simple tweaks. I have also included a simple installation overview of SQL Developer for Ubuntu.
In large databases it can be a challenge to have data type consistency across many tables and views, especially since SQL does not understand PL/SQL’s
When designing the overall structure of the tables, tools such as SQL Developer’s Data Modeller can be used to reduce the pain associated with potential data type inconsistencies.
However, as databases grow and evolve, data types may diverge and cause headaches when moving data back and forth.
Here I present a utility to identify and automatically fix many of these issues.
Collections are core components in Oracle PL/SQL programs.
You can (temporarily) store data from the database or local variables in collections and pass these collections to subprograms.
Collections are also critical to bulk operations, such as
BULK COLLECT and
FORALL, as well as table functions, both simple and pipelined.
Bulk operations and table functions are critical to high-performance code.
Here, I provide an overview of their characteristics as an introduction to novice PL/SQL developers and as an one-stop reference.
The way a company looks at its data is indicative of its readiness to embrace a data governance programme: is data a by-product of doing business or an asset that requires attention and resources? One of the key questions with data governance is, ‘Why?’
Why should you govern your data? What’s the benefit?
Coding standards are important because they reduce the cost of maintenance. To enable database developers on the same team to read one another’s code more easily, and to have consistency in the code produced and to be maintained, I have prepared a set of coding conventions for Oracle SQL and PL/SQL. These are by no means the be-all and end-all of Oracle Database standards, and in some instances you may not agree with the conventions I have proposed. That’s why I have created an easy-to-share, easy-to-edit Markdown document with these guidelines, including a snazzy CSS3 style sheet, in my Bitbucket repository. You can adapt these guidelines for your organization’s needs as you see fit; an attribution would be grand but I won’t sue you if you’re dishonest.
In almost all areas of software development, unit testing is not only common sense but also common practice. After all, hardly any serious software vendor would dare ship applications without having properly tested their functionality. When it comes to databases, many organizations still live in the Dark Ages. With Oracle SQL Developer there is absolutely no reason to remain in the dark: unit testing PL/SQL components is easy, free, and fully integrated into the IDE.
When it comes to SQL statements and optimizing queries on relational databases, probably the first thing developers (ought to) look at is the execution plan. The execution plan shows you what the database engine thinks is the best way to execute a query and it gives estimates of relevant runtime indicators that influenced the optimizer’s decision.
When a query involves calls to remote databases you may not always get the best execution (plan) available, because Oracle always runs the query on the local database as it has no way of estimating the cost of network traffic and thus no way of weighing the pros and cons of running your query remotely versus locally. Many tips and tricks have been noted by gurus and of course Oracle, but I was recently asked to tune a query than involved more than the textbook cases typically shown online.
Plentiful are the companies that revel in fancy descriptions of data-driven decision-making cultures on their corporate websites. Scarce are they who actually have a data governance office to back up these grand claims, for any data-centred programme without clear definitions and business processes regarding data is doomed to fail.
What is less known about data governance is that there is a phase during which companies run a risk of losing their best and brightest because of inaction or even worse: wrong actions. Michael Lopp has written an excellent article on why bored engineers quit, and as bad as that may be in general situations it is disastrous in the early phases of a data governance programme.
Databases and especially data warehouses typically consist of many dozens of tables and views. Good documentation is essential but even the best documentation cannot answer your questions as quickly as you want the information.
When I was rummaging through my digital attic I found code that I had worked on a few years ago. It is not related to data or database technologies but it is in itself interesting, so that I thought I’d better share it.
To calculate integrals in mathematics or solve differential equations in physics or chemistry you often need a computer’s help. Only in very rare cases can you express the integral symbolically. In most instances you have to be satisfied with a numerical solution, which for many problems is perfectly fine.
In the first part I have shown several options to calculate aggregation functions along various branches of a hierarchy, where the values of a particular row depend on the values of its predecessors, specifically the current row’s value is the product of the values of its predecessors.
Performance was the elephant in the room. So, it’s time to take a look at the relative performance of the various alternatives presented.
Hierarchical and recursive queries pop up from time to time in the life of a database developer. Complex hierarchies are often best handled by databases that are dedicated to such structures, such as Neo4j, a popular graph database. Most relational database management systems can generally deal with reasonable amounts of hierarchical data too. The classical example of hierarchical queries in SQL is the employees table: construct the organization chart with the CEO sitting at the top of the tree and all employees dangling from the branches on which their respective managers have placed their bottoms. While a direct acyclic graph of all company talent may tickle your fancy, I very much doubt it.
A while ago I had the pleasure to visit Strasbourg, the capital of the Alsace region and home of the European parliament. Strasbourg is a lovely city to spend a few days, go shopping, enjoy French and Alsatian cuisine, and take in the scenery, most notably the picturesque Petite France area with its half-timbered houses and restaurants serving local delicacies along the river Ill.
The story I want to tell has very little to do with Strasbourg though. It is about homelessness, statistics on homelessness, and more to the point: the lack of action by our governments to solve the problem of homelessness.
When it comes to data quality, many companies are still stuck in firefighting mode: whenever a problem arises, a quick fix is introduced, often without proper documentation, after which the problem is filed under ‘H’ for “hopefully not my problem the next time around”. The patches are often temporary, haphazard, and cause more problems downstream than they really solve. In most instances there is no obvious reason for it, which makes a bad situation worse. It reminds me of the saying that the treatment is worse than the disease, so it’s time to bring in the doctor…
Let’s talk about my goings-on over at Read The Docs (RTD).
Every time I hear that buzz phrase whizzing around I have the inexplicable urge to smack it back into the hive.
It was of course a public relations gimmick by Goldman Sachs and PricewaterhouseCoopers to ‘predict’ the outcome the 2014 FIFA World Cup. Their best and brightest looked into their crystal footballs at historical data and created statistical models that foresaw what has recently transpired.
Let’s take a look at what they claimed before the World Cup and what really happened in Brazil in order to answer the question: Were they right?
The fourth and final part of the series on the challenges of data integration is about data governance. Data governance is not so much a challenge as it is a critical component of continued success. Data integration is usually only the Band-Aid that is applied to a particular business problem. Beyond that it has the power to transform a business, but to do so you need to continuously guard, monitor, analyse, and improve your data and related business processes, so that the information you glean from it is always sound.
The third part of the four-part series on the challenges of data integration deals with people. I have already hinted at a few people issues in the first and second parts on technical and project management challenges, respectively, but I have not gone into specifics.
Where people work together there will be conflicts. As we shall see, data integration projects can be particularly tricky, as they require dirty data to be ‘smuggled’ over organizational borders into enemy territory.
In this second post of a four-part series on the challenges of data integration I want to talk about project management. Data integration is, as I have said before, not simply a matter of throwing technical people at a business problem. Not literally of course: most people do not like being flung at things, abstract or concrete, but probably at the latter a bit less than at the former.
Project management is the key to your success. Sure, you need able people to build the data warehouse, but without a solid foundation in project management your project will tip over at the slightest sigh. And please take it from someone who has been there, done that, got the T-shirt, and has outgrown it: there will be a lot of sighs during the project, even full-blown tornadoes… To weather any storm, you and the entire organization have to live project management practices. Project management is not the silver bullet, but it can protect you against the most common enemies: no idea, no plan, no back-up plan, and no support.
Data integration is a formidable challenge. For one, data integration is never the goal of an organization. Similarly, a data warehouse is never the objective. It is merely a vehicle that can drive you to your destination. Data storage and integration for data’s sake are a waste of time, money, resources, and nerves. Without a clear business case, effective leadership, and strong support, including but not limited to a highly visible and respected sponsor, any data integration project is doomed from the get-go.