In this article, I would like to explain my journey in developing a model for automatic harmonic analysis. Personally, I am interested in understanding music deeply. Questions like: “Why are things structured the way they are?” and “What was the composer or artist thinking when writing the piece?” are important to me. Naturally, the way to start was for me to analyse the underlying harmony of a piece.
Scavenging my old notebooks back from the conservatory I stabled upon the technique we were using to annotate and analyze small musical excerpts. It is called Roman Numeral analysis. The idea might be a bit complicated if you never heard about it before but please bare with me.
My goal is to build a system that can automatically analyze musical scores. Given a score then the system will return the same score with an extra staff containing the chords in Roman numeral notation. This should work mainly for classical tonal music but is not necessarily limited to that.
In the rest of this article, I will introduce the concepts of Roman Numerals, Graph Neural Networks, and discuss some details about the model I developed and the results. I hope you enjoy!
Introduction to Roman Numerals
Roman Numeral analysis is a method used to understand and analyze the chords and harmonic progressions in music, particularly in Western classical music and popular music. Chords are represented using Roman numerals instead of traditional musical notation.
In Roman Numeral analysis, you see, each chord is assigned a Roman numeral based on its position and function within a given key. The Roman numerals represent the scale degrees of the key, with uppercase numerals representing major chords and lowercase numerals representing minor chords.
For example, in the key of C major, the C major chord would be represented by the Roman numeral “I” (uppercase “I” denotes a major chord). The D minor chord would be represented by “ii” (lowercase “ii” denotes a minor chord). The G major chord would be represented by “V” (uppercase “V” denotes a major chord) because it is the fifth chord in the key of C major.
Roman numerals are always relative to a key. Then if the key is C major then the Roman numeral “V” would be the dominant or the G major chord. But chords do have different qualities for example minor or major. In Roman numerals, capital letters stand for major quality and lowercase for minor quality.
In music analysis, usually the lowest note is a point of reference about the character of a chord. Roman numerals are able to convey this information too. In the example above, the bass (lowest chord note) of the second chord is F sharp, but the root of the chord is D therefore the chord is in 1 inversion, indicated with the number 6.
Another interesting notation capability of Roman numerals is related to borrowed chords. This effect is called secondary degree, implicitly every Roman numeral (primary) has a secondary degree of the tonic (i.e. I or i), however, when the secondary degree is annotated then we are informed which scale degree is acting as the tonic momentarily. The third chord, in the example above, has a dominant seven as its primary degree and the dominant of C major as its secondary degree. The V65 indicates a major with a seven quality in second inversion.
Roman Numeral analysis helps musicians and music theorists understand the structure and relationships between chords in a piece of music. It allows them to identify common chord progressions, analyze harmonic patterns, and make comparisons between different musical compositions. It is a useful tool for composers, arrangers, and performers to understand the underlying harmony and make musical decisions based on that knowledge.
Automatic Roman Numeral Analysis
Now that we have a basis for what Roman Numeral analysis looks like in practice we can discuss how to automate it. In this article, we will cover a method to predict Roman Numeral from symbolic music, i.e. digital scores (MusicXML, MIDI, Mei, Kern, MuseScore, etc.). Please note that you can obtain some of these formats from any score editor software such as Finale, Sibelius, MuseScore, or any other. Usually, the software allows for an export to a musicxml (uncompressed) format. However, for if you don’t have any of these editors I suggest using MuseScore.
Let’s now discuss the representations in more depth. In contrast to audio representations where music can be seen as a digital sequence in the waveform level or a 2-D spectrogram in the frequency domain, the symbolic representation has individual note events carrying information such as onset time, duration, and pitch spelling (names of notes). The symbolic representations have often been treated as a pseudo-audio representation separating the score into quantized time frames, for example, a pianoroll (like the figure shown below). However, recently some works proposed a graph representation of a score where every note represents a vertex in the graph and edges represent relations between notes. For the latter, scores can be transformed in this graph structure which is particularly useful when a Machine Learning model is involved.
So given a symbolic score, the graph is constructed by modelling 3 relationships between notes.
- Notes starting at the same time, i.e. same onset.
- Note starting when the other ends, i.e. consecutive notes.
- Notes starting while the other is sounding, i.e. during connection.
The graph of the score can be used as input to a Graph Neural Network which implicitly learns by propagating the information along the edges of the graph. But before we explain how a model works on scores, let’s first briefly explain how Graph Neural Networks work.
So, what exactly are Graph Neural Networks? At their core, GNNs are a class of deep learning models designed to handle data represented as graphs. Just like real-world networks, graphs consist of interconnected nodes or vertices, each with its own unique features. GNNs leverage this interconnectedness to capture rich relationships and dependencies, enabling them to perform analysis and prediction tasks.
But how do GNNs work? Imagine a musical score where each note is a node, and note relations represent the connections between them. Traditional models would treat each note instance individually, ignoring the musical context. However, GNNs embrace this context by considering both the individual’s features (e.g., pitch spelling, duration) and their relationships (same onset, consecutive) simultaneously. By aggregating information from neighbouring nodes, GNNs empower us to understand not only individual notes but also the dynamics and patterns within the entire network.
To achieve this, GNNs employ a series of iterative message-passing steps. During each step, nodes gather information from their neighbours, update their own representations, and propagate these updated features further through the network. This iterative process allows GNNs to capture and refine information from nearby nodes, gradually building a comprehensive understanding of the entire graph.
The message-passing process when done iteratively in the network is sometimes called graph convolution. A popular graph convolution block that we also used in our music analysis model is called SageConv, from the famous GraphSAGE paper. We won’t cover the particulars here but there are many sources covering the functionality of GraphSAGE, such as this one.
The beauty of GNNs lies in their ability to extract meaningful representations from graph data. By learning from the local context and combining it with global information, GNNs can uncover hidden patterns, make accurate predictions, and even generate new insights. This makes them invaluable in a wide range of domains, from social network analysis to drug discovery, traffic prediction to fraud detection, and now to music analysis.
The model used for Roman Numeral analysis is called ChordGNN.
As the name suggests, ChordGNN is a model for automatic Roman Numeral analysis based on Graph Neural Networks. A particularity of this model is that is leverages note-wise information but produces onset-wise prediction, i.e. a Roman Numeral is predicted for each unique onset event of the score. That signifies that multiple notes at the same onset will share the same Roman Numeral just like when annotating a musical score. However, by using Graph Convolution information from every note is propagated through the neighboring notes and onsets.
ChordGNN is based on a Graph Convolutional Recurrent Neural Network Architecture and it is composed of stacked GraphSAGE Convolutional Blocks that operate at the note level.
The Graph Convolution is followed by an Onset-Pooling Layer that contracts the note representations to the onset level, thus resulting in a vector embedding for each unique onset of the score. This is an important step as it moves the representation from a graph to a sequence.
The embeddings obtained by the Onset-Pooling, which are also ordered by time, are then fed to a Sequential model, such as a GRU stack. Finally, simple Multi-Layer Perceptron Classifiers are added for each one of the attributes that describe a Roman Numeral. Therefore, ChordGNN is also a Multi-Task model.
ChordGNN does not directly predict the Roman numeral for every position of the score but rather predicts the degree, local key, quality, inversion and root instead. The predictions of each attribute task are combined into a single Roman Numeral prediction by analyzing the predictions for each of the tasks. Let’s see what the output predictions looked like.
In this section, we will look at some of ChordGNN’s predictions and even compare them with an analysis done by a human. Below is an example of the first bars from Haydn’s string quartet op.20 №3 movement 4.
In this example, we can view several things. In measure 2, the human annotation marks a tonic in first inversion; however, the viola at that point is lower than the cello and therefore the chord is actually in root position. ChordGNN is able to predict this correctly. Subsequently, ChordGNN predicts a harmonic rhythm of eighth notes, which disagrees with the annotator’s half-note marking. Analyzing the underlying harmony in that passage, we can justify our ChordGNN’s choices.
The human annotation suggests that the entire second half of the 2nd measure represents a viio chord. However, it should not be in the first inversion, as the cello plays an F# as the lowest note (which is the root of viio). However, there are two conflicting interpretations of the segment. First, the viio on the third beat is seen as a passing chord between the surrounding tonic chords, leading to a dominant chord in the next measure. Alternatively, the viio could already be part of a prolonged dominant harmony (with passing chords on the offbeats) leading to the V7. The ChordGNN solution accommodates both interpretations as it doesn’t attempt to group chords at a higher level, treating each eighth note as an individual chord rather than a passing event.
Above is another example comparing the predictions of ChordGNN with the original analysis of a Mozart Piano Sonata. In this case, ChordGNN’s analysis is a bit more simplistic, choosing to omit some chords. This is happening on two different occasions with the dominant seven in 4 inversion (V2). This is a reasonable assumption for ChordGNN since the bass is missing. Another disagreement between the annotation and the prediction occurs at the half cadence towards the end. ChordGNN is treating the C# of the melody as a passing note where the annotator chooses to specify the extension of #11.
In this article, we discussed a new method for automating Roman Numeral Analysis using Graph Neural Networks. We discussed how the ChordGNN model works and showcased some of its predictions.
E. Karystinaios, G. Widmer. Roman Numeral Analysis with Graph Neural Networks: Onset-wise Predictions from Note-wise Features. Proceedings of International Society of Music Information Retrieval Conference (ISMIR), 2023.
All images and graphics in this article are created by the author.