Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Resources on fine tuning model & local exectution #12

Open
mightimatti opened this issue Jan 3, 2024 · 15 comments
Open

Resources on fine tuning model & local exectution #12

mightimatti opened this issue Jan 3, 2024 · 15 comments

Comments

@mightimatti
Copy link

Hi,
I came across this repository and have played around with the notebooks a little bit and succeeded in running the model locally to perform inpainting on midi files of mine.
I was wondering whether there are any resources on how one would go about fine-tuning the model with some more training(I would like to see if I can imitate the style of a specific composer such as J.S. Bach) using only Bach midi files. I don't quite understand how I need to preprocess the MIDI files to this end. Is there a training script available that lends itself to local training?

Further I was wondering whether you could outline how to generate MIDI using ONLY the prior embedded within the model(no additional Seed MIDI at inference time). Is this possible?

Thank you for the excellent model!

@asigalov61
Copy link
Owner

@mightimatti Hi

Thank you for writing and appreciating my work. It means a lot to me :)

Unfortunately, there is no local (cli python) training and MIDI processing scripts just yet. However, there are colab notebooks that would allow you to process midis if you want to fine-tune.

https://github.com/asigalov61/Allegro-Music-Transformer/tree/main/Training-Data

At this dir you will find MIDI processing colab and auto-generated python script that you can use to process files locally.

However, Allegro does not have stand-alone fine-tuning script, which is why I wanted to invite you to check out my Giant Music Trasformer model/implementation. It is much better suited for fine-tuning and it has a stand-alone fine-tune colab/code ready to be used.

https://github.com/asigalov61/Giant-Music-Transformer

https://github.com/asigalov61/Giant-Music-Transformer/tree/main/Fine-Tune

In regard to generating without seeds, you can find such generator code for both Allegro and Giant in the Original Version of the generator colab under Improv Generation:

https://colab.research.google.com/github/asigalov61/Allegro-Music-Transformer/blob/main/Allegro_Music_Transformer.ipynb

https://colab.research.google.com/github/asigalov61/Giant-Music-Transformer/blob/main/Giant_Music_Transformer.ipynb

Giant Music Transformer also has bulk improvs and continuations generations which you can use to generate from scratch:
https://colab.research.google.com/github/asigalov61/Giant-Music-Transformer/blob/main/Giant_Music_Transformer_Bulk_Generator.ipynb

Hope this answers your questions and I hope it is not too confusing since there is a lot :)

But if you need more help, especially with fine-tuning, feel free to ask :)

Alex

@mightimatti
Copy link
Author

mightimatti commented Jan 3, 2024

Hi Alex,
thank you for the quick reply. After logging into Colab and running both Notebooks in Google Collab, I have come to decide that the Allegro music Transformer seems a lot more promising for what I am trying to do.
Looking at the code in Allegro-Music-Transformer/Training-Code it would appear that the Dataset is expected to have the same format as what the Giant Music Transformer expects.
Can I use the Dataset Preprocessor of the Giant Music Transformer to preprocess Data I intend to then feed into the Allegro-Music-Transformers notebook? Or is the format different?

I apologize for the many questions, but I am happy to share my code once I successfully train my model as this might be useful to others who wish to train this. The reason I would like to train a local instance is to allow me to train on a cloud instance from Vast.ai

@asigalov61
Copy link
Owner

@mightimatti Np. I am happy to help :)

While somewhat similar, Allegro data format is still a bit different from the Giant, so you have to use Allegro MIDI processor if you want your data to be compatible with Allegro.

Allegro MIDI processor code is located here but it will need some clean-up since it was auto-generated by Google Colab:
https://github.com/asigalov61/Allegro-Music-Transformer/blob/main/Training-Data/allegro_music_transformer_training_dataset_maker.py

So use this to process your MIDIs and then use Allegro Training Code to train or fine-tune:
https://github.com/asigalov61/Allegro-Music-Transformer/blob/main/Training-Code/allegro_music_transformer_maker.py

Other than that, I would be happy to add your local training code version to the repo if you will make a PR once you do it :)

I really apologize for not having proper local versions for MIDI processor code and training code. The reason why is that it is more practical and convenient to have it as collabs as many people do not have local GPUs for Music AI.

I also wanted to suggest you to check out Lambda Labs (https://lambdalabs.com/) for training and inference. It has very good prices and also it is fully compatible with Jupyter/Google Colab notebooks which makes it easy to do all this stuff.

Hope this helps.

Alex

@mightimatti
Copy link
Author

Thank you again.
I am currently working my way through the errors generated while running the automatically generated scripts.

I'll let you know how it progresses :-)

Is there a specific reason why you prefer git cloning TMIDIX as opposed to just installing it via pip install git+https:.... or just via regular Pip? Is this a more up-to-date version?

When I last tried using Lambda, about a year ago, I couldn't get it to recognize my european credit card, so I never got to use it. It does look good though

@asigalov61
Copy link
Owner

@mightimatti Yes, those auto-generated scripts need some clean-up

I clone TMIDIX/tegridy-tools because it is most up-to-date version, That is correct. I am not very handy with PyPl/pip packages so I never get to update it really. So git cloning is the best way to get the latest version.

I see about Lambda.

Well, anyway, let me know if you need more assistance. I am here to help. And feel free to PR local implementation if/when you will finish it because I think others may find useful as well :)

Alex

@mightimatti
Copy link
Author

Hi Alex,
thank you for being so available in helping out so far and thanks again for the excellent model.

I was able to get things to run locally and I am having mixed results with the model so far. I'll give you a little background on what I am trying to do, as I have hit a roadblock in moving forward with my use of the model and I need a little guidance. We switched over to the Giant Music Transformer as this is what you reccomend:

My homework group and I have spent a substantial amount of time repurposing the Giant music Transformer you recommend for the specific task of generatign 4-voice choral compositions as part of an experimental AI course at university. We are trying to generate 4-channel MIDI compositions which are reminiscent of choir pieces. I succeeded in preparing approx. 8K midi files which contain 4 distinct channels and am now trying to modify the dataset maker to pitch the files to CMaj(or AMin in the case of a minor mode), remap the channels to channels 0-3 and finally run the preprocessor as usual.

What I have come to realize, is that the Giant Music Transformer, even in it's basic(non-XL) form is way larger(more parameters) than what I suspect I need, as all my music is in the same key and 4-monophonic channels are used, as opposed to 16 arbitrary(polyphonic) ones. I have had somewhat reasonable results training overnight with ~1m parameters on Johan Sebastian Bach's collected choral works(albeit with all instruments mapped to a single channel) and would now like to retrain, this time with 4 channels.

I am modifying the preprocessor to this end and came across the following questions I can't seem to clarify:

Where does the constant for the model parameter num_tokens come from? The Padding IDX in my understanding of seq2seq models is based on the largest embedding index it is likely to encounter, yet it is a hardcoded value which is not dependent on any preprocessor parameters fixing the embedding space. I see a lot of seemingly arbitrary integers in the same range(~22-23K) in the preprocessing file, more specifically the "outro" section, which I suspect are related to this number. E.g.:

                        if ((comp_chords_len - chords_counter) == 50) and (delta_time != 0):
                            out_t = 22273+delta_time
                            out_p = 22785+ptc
                            melody_chords.extend([22913, out_t, out_p]) # outro seq

                        #=======================================================
                        # Bar counter seq

                        if (bar_time > pbar_time) and (delta_time != 0):
                            bar = 21249+min(1023, (bar_time-1)) # bar counter seq
                            bar_t = 22273+bar_time_local
                            bar_p = 22785+ptc
                            melody_chords.extend([bar, bar_t, bar_p])
                            chords_counter += 1
                            pbar_time = bar_time

                        else:
                            if delta_time != 0:
                                chords_counter += 1

Where is the this number coming from, what I can do to reduce the dimensionality of the Embedding space or calculate the required size?
I have been experimenting with changing the model parameter to reduce the dimensionality and VRAM requirements and I would like to understand how the embedding space is determined.

Is there a chance you could give me a brief overview of the steps that the Preprocessor goes through and briefly describe the representation of the Training instances? Maybe a higher level understanding will allow me understand the rest on my own.

Btw I haven't forgotten about sharing my local execution code so others can use it, but I am waiting to finish my project and share the (polished) results.

Sorry for the many questions and thank you for your availablity so far.

@asigalov61
Copy link
Owner

@mightimatti Your are welcome :) Thank you for appreciating my work and please know that I am always happy to help :)

To answer your question... I am sorta old-school when it comes to coding so my code can be hard to read, for which I apologize...

But basically, num_tokens was calculated by adding all encoding tokens together+pad_idx+1. So the number that was hard-coded in the original processor colab is the exact number of tokens needed for all features of the implementation.

To give you a specific exact breakdown: 512(delta_start_time) + 4096 (duration_velocity) + 16641 (patch_pitch) + 1024 (bar_counter) + 512 (bar_time) + 128 (bar_pitch) + 1 (outro_token) + 2 (drums_present ) +129 (intro_seq_patch) + 1 (intro/SOS) + 1 (EOS) + 1 pad_idx == 23047

This is for the original/full-featured version. Pre-trained models were made with a simplified version of the same so that they can fit into consumer GPUs at reasonable # of batches.

I hope this clarifies how the num_tokens was calculated.

And yes, for your purposes you can probably make a smaller/striped down version of the GMT since you only have 8k midis and only 4 channel/short generation requirements.

However, you should try fine-tuning too since it may produce good results as well. I ran a few fine-tuning experiments with GMT and fine-tuned versions play really well if the generation seq_len is not too long.

Anyway, let me know if you need more help or if you need me to elaborate more on the GMT MIDI processor.

Alex.

@mightimatti
Copy link
Author

Hi Alex,
thank you for the swift reply.

I ran a training all night with my new Dataset and the sequence length set to 4K and came to realize that my 900K parameter model is probably underfitting with this sequence length. ATM I am not able to reduce the dimensionality of the embedding space, because I don't understand the parameters you mention and how I would go about reducing them. I don't understand how you determined, as an example, that the patch_pitch needs to be 16641.

  1. Could you maybe briefly outline how one would go about recalculating these values for a simplified representatin of the dataset?

  2. If I know, that there are only 4 possible channels how do I take advantage of this simplification?

  3. What is is the meaning of patches/ "patch changes". I suspect they are related to the instrument represenation as well?

  4. If I know that the voices are mostly monophonic, is there a way of seizing this?

  5. I already removed the augmentation that shifts by 1 semitone, as I wish to generate mostly in CMaj/Amin(mostly "white keys" of the piano/no accidentals), can I use this information?

I'm sorry if this seems obvious to you, but I really don't understand the intermediate representation/embedding and I believe this is key to understanding both the model and the synthesis of audio.

  1. I am further puzzled by how I can set the instruments that the channels are mapped to when synthesizing MIDI, as I don't quite understand MIDI very well (yet). I limited the pool to channels 0-3, but I find that the instruments that result in the MIDI file(more specifically their interpretation in Musescore, the program I use to inspect them) changes from output to ouput. Is there a way i can control this using TMIDIX?

Sorry for the many questions. FWIW I will try and write my code in such a way that maybe future users can benefit from my more generalized preprocessing script!

@asigalov61
Copy link
Owner

@mightimatti Yes, I would be happy to help you here :) You are definitely going in the right direction with all this :)

First of all, you need to properly prep/process MIDIs. From my experience, it is important to use 1024 embed (this is a minimum) and also appropriate seq_len which will be close to the average length of your MIDIs when encoded as you prefer it.

4096 length is a good length if you use triplets or quad encoding... I will explain and show below...

So I will begin by answering your questions...

  1. In GMT, patch_pitch tokens range was determined by simple multiplication of 128 MIDI patches by 128 MIDI pitches. Its basically a simple stacking of values. I.e (128 * 129) + 128. We want to use 129 here because 128 patch + 128 last pitches needs to be covered as well, so it is 129 * 129 == 16641. Now, this was done to create individual ranges for each patch as they are different. I.e different MIDI instruments play differently and therefore need to be separated in their own range/channels so to speak.

  2. Now, since you want to use only 4 channels/4 patches (or maybe groups of patches), you do not need as many tokens for that as original GMT uses. So what you need is to stack only what you use for the most efficient and ideal arrangement/encoding. So in your case it is going to be 128 * 5 = 640 tokens to encode 4 separate channels/groups of patches. This should work much better for your purposes.

  3. patch_changes in GMT MIDI processor is simply a decoding of standard MIDI patch_change events into individual notes encoding. In standard MIDI specification, notes are not labeled with patches, only channels. So this part of the code in GMT labels each MIDI note with its patch for further easy processing.

  4. If one or more of your voices are monophonic, it is absolutely fantastic, because you can use it in many ways to improve training and also generation for controlled and nice output. Particularly, if all your MIDIs have mono melodies, you can definitely seize on that and use it as a control mechanism for generation.

  5. If your MIDIs/compositions are homogeneous and in a specific key(s) it is also very helpful for training and generation so you can definitely incorporate it into your MIDI processing code.

  6. Now, for decoding back from model output, you can use number of ways to do it easily. Basically, you either need to assign static patches to your channels or you can use MIDI patch_change events to dynamically control the same.

Yes, TMIDIX is the best way to go about it because it has everything you may need for your project and I finally streamlined it in my latest update last month so definitely try it out.

To help you out further,..I wanted to invite you to use the shared Google Colab below so that I can show you how you can do your project and also to help you get familiar with TMIDIX functions/approaches.

https://colab.research.google.com/drive/1XALSVLcnCqYvPiyv6ZWl99Q0a4gGIh3o?usp=sharing

This colab uses my POP Melody Transformer implementation as a core code, but we will modify it for your needs and specs shortly so do not worry if it says POP Melody Transformer.

So what I need from you to help you effectively, is a few example MIDIs from your dataset. I hope you can share them here as a zip attachment so that I can take a look at it and adjust the colab code appropriately for you.

Alex.

@mightimatti
Copy link
Author

Hi Alex,
as always, thank you for your help.

I have read your answer multiple times and come to realize that my misunderstanding of MIDI led me to use channels the way I should have used Patches... I used them to encode different tracks/instruments. I think I understood quite a bit more about the GMT in going through the code for the past hour with the answers you gave me. While the next step for me is making modifications to incorporate these insights into my version of the Dataset maker, it is also complicated by the code being a bit of a mess, as I believe it's an automatic export from collab, which I already adapted to run it locally on my computer.

I appreciate your offer to help with this in the way of having a look at the code very much, but it is difficult for me to use the Colab you shared, as it seems to differ from the GMT's dataset maker, at least the version I am using quite a bit and as I have already made significant modification to the preprocessor to incorporate the change of key, and to generally be able to run the script locally(the original intention of this thread). Maybe I can ask you two more questions which would greatly help me:

  1. I can't follow your reasoning why I need 640 entries for the pitches of 4 channels. Considering the first entries are stored with offset 0, I feel like 512 entries should be enough, e.g

[0, 127], [128, 255], [256, 383], [384, 511]

The reason I ask is not because I am desperately trying to reduce the dimensionality, but because I feel I might not be following. Is this related to the extra patch for the drums? Or is there another reason why I need an extra 128 pitches/entries.

  1. What is the reason behind the division by 8 on the duration? Is this the octo-velocity you mention in the repo's description? What is the intuition behind this?

  2. Could you briefly break down the remaining dimensionality of the Dataset? Is the sequence length/resolution of the timing's encoding also implicit in the dimensionality of the embeddig? It would be really useful to understand all the other dimensions. For instance I don't quite understand the intution behind the intro/outro. Is this to mark sequence limits, like a padding entry?

Thank you very much, I apologize in case my questions shouldn't be clear. It is very late over here in Berlin :-)

@asigalov61
Copy link
Owner

@mightimatti Np, whatever works for you better :) I just thought that showing a simple colab example may make things easier to understand but if not then let me just answer your questions the best I can... :)

  1. I think I might've misread your question about pitches and channels... my bad here... So if you have only 4 channels (0-3) you indeed only need 512 tokens to represent them. 128 * 4 = 512.

  2. In GMT, I combined duration and velocity into one token just like I did with pitches/patches, so division by 8 is to decode octo-velocity indeed. You understand this correctly. I choose octo-velocity because full midi velocity range is excessive and not very practical to encode because models usually can't produce meaningful velocity if given full midi range(128). If the range is smaller (8) then the model outputs better velocity predictions.

  3. In regard to timings, the version of GMT that you quoted in OP uses 512 tokens for delta start times and durations. However, this is excessive so my production versions of GMT use only 256 tokens for delta start times and durations, which is roughly 64 tokens per 1 ms.

Now, intro and outro are simply aux tokens(seqs) to soft-condition the model for intros and outros. So for examples, I use outro tokens to generate outro (ending) of the composition when I use the model to do a continuation of the composition as the model may not always generate the ending by itself. Same with intro...I use it to soft-condition the model to generate some arbitrary beginning of the composition when I compose with the model.

What else do you want to know about GMT dataset maker? Ask me specifically so that it is easier for me to explain it to you since there are a lot of features in the dataset maker.

Hope this answers your current questions but feel free to ask more if needed :)

Alex.

PS. Check out TMIDIX advanced_score_processor function. It is based on GMT processor code and it is very handy for working with my code/implementations. Please note that you may need to pull the latest TMIDIX from tegridy-tools as I believe GMT copy does not have latest updates...

@mightimatti
Copy link
Author

mightimatti commented Feb 4, 2024

Hi Alex,
I spent yesterday working on a modified version of the code that transfers MIDI to the model's INT representation and back, based on user provided parameters, as a means of experimenting with different configurations and seeing how much I can reduce the dimensionality. The idea is based on the observation, that the specific parameters you chose for the embedding/encoding also affect the decoding step and it would be good to have a parametric, unified interface, as you can easily forget to update some value in one of the two steps and end up producing incorrect outputs. In parts, this is also because this makes it possible to calculate the offset values for the encoding based on said configuration. Hard-coding these values in the code is incredibly difficult to decipher and I have found myself making multiple off-by-one errors while trying to modify these values.
Another advantage is that coding this in one place also makes it so, that one can code a simple "roundtrip" to make sure that data is consistently encoded and decoded.

My idea is that every time data is encoded/decoded, a dictionary is created which contains all the relevant factors and offsets. This can be stored with the dataset and ensure consistent encoding and decoding. It might even help you with transferring embeddings from one model's code base to another. This is what I came up with so far.

(Please disregard the pitch_shift, pitches_per_patch and patch_mapping_offset. They are specific to my use case, as choral compositions tend to remain within the limits of human's vocal range and I could therefor reduce the range of permissible pitches-per-patch to the range between bass's lowest note(E2) and soprano's hightest (C6) thereby rougly halving the pitches that need to be encoded)

def get_paramters_with_defaults(configuration_parameters, DEBUG=False):
    """
    Get all relevant parameters for the parameter embedding,
    overwriting the default parameters with user-provided configuration values
    """


    # Default parameters. User passed params overwrite these/
    PARAMS = {
        # unpack parameters
        # the sequence lenght the model was trained for
        "model_sequence_length": 1024,
        "entries_per_duration": 256,
        # a value that time/duration/velocity are divided by during dataset preprocessing
        # see https://github.com/asigalov61/Allegro-Music-Transformer/issues/12#issuecomment-1923632903
        # octo-velocity refers to this being 8....
        "temporal_downsampling_factor": 8,
        "patch_count": 4,
        "channel_count": 4,
        # number of different values to allow as valid pitches per patch
        "pitches_per_patch": 60,
        # if only a subset of permissible MIDI pitches is to be used, i.e. pitches_per_patch != 128,
        # offset index 0 with this value
        "pitch_shift": 12,
        # value that is added to the patch value
        "patch_mapping_offset": 53,
        # number of velocity values in a MIDI file
        "number_velocities": 8,
    }

    # update defaul values with user-provided values
    PARAMS.update(**configuration_parameters)

    """
        derived values

        These values are calcutlated based on the parameters above, 
        as they depend on these.
    """
    # Number of values that duration can take.
    entries_per_temporal_val = (
        PARAMS["model_sequence_length"] // PARAMS["temporal_downsampling_factor"]
    )
    # number of different values to allow as valid pitches per patch
    pitch_entries = PARAMS["patch_count"] * PARAMS["pitches_per_patch"]
    duration_velocity_entries = entries_per_temporal_val * PARAMS["number_velocities"]

    max_pitch = PARAMS["pitches_per_patch"] + PARAMS["pitch_shift"]
    # Valid range of pitches to check against
    valid_pitch = range(PARAMS["pitch_shift"], max_pitch)
    # write to dict
    DERIVED_VALUES = {
        "entries_per_temporal_val": entries_per_temporal_val,
        # "entries_per_duration": entries_per_duration,
        "pitch_entries": pitch_entries,
        "duration_velocity_entries": duration_velocity_entries,
        "max_pitch": max_pitch,
    }

    """
        LIMITS 
    """

    # Define the indexes limiting various properties encoded in the INTs.
    # The final value is the embedding dimension
    LIMITS_TIME = range(0, PARAMS['entries_per_duration'])
    LIMITS_DURATION_VELOCITY = range(
        PARAMS['entries_per_duration'], PARAMS['entries_per_duration'] + duration_velocity_entries
    )
    LIMITS_PITCH = range(
        PARAMS['entries_per_duration'] + duration_velocity_entries,
        PARAMS['entries_per_duration'] + duration_velocity_entries + pitch_entries,
    )

    LIMITS = (
        LIMITS_TIME,
        LIMITS_DURATION_VELOCITY,
        LIMITS_PITCH,
    )

    if DEBUG:
        print("####" * 12)
        print("Loaded Embedding parameters from config")
        print("####" * 12)
        print(f"User config: {pformat(configuration_parameters)}")
        print("####" * 12)
        print(f"Resulting config: {pformat(PARAMS)}")
        print("####" * 12)
        print(f"Derived values: {pformat(DERIVED_VALUES)}")
        print("####" * 12)
        string_limits = "\n".join(map(lambda x: f"[{x[0]}, {x[-1]}]", LIMITS))
        print(f"Embedding ranges: {string_limits}")
        print("####" * 12)

    return (PARAMS, DERIVED_VALUES, LIMITS)

In the process I encountered the following questions:

  1. Does delta_start_time's maximum value depend on the sequence length? I noticed it is set to 512 and I assume this is because the maximum sequence length is 4096 and it is downsampled by a factor of 8....(because of the temporal resolution you described with respect to the Velocity). OTOH you say that you later reduced this value to 256, does this mean, that these are two separate values? I see in the code that generates the INTs that you are dividing the timing by 8, which would seem to indicate that this is either because of the ratio between duration values and overall sequence length, or because of the velocity downsampling. Please elaborate how dependent these values are on one another.

  2. On a more general level, I don't quite undersand how timing is handled. This may be related to my lacking understanding of MIDI more generally, but I don't see how you arrive at the value of 64 tokens per ms. In my understanding, the amount of tokens per ms would depend on the speed and length of the composition.
    How is this calculated? ("[....] production versions of GMT use only 256 tokens for delta start times and durations, which is roughly 64 tokens per 1 ms.") . I suspect the bar_counter has something to do with this and that the actual, absolute time of an event within it's MIDI file's global time is counted as a combination of bar_time, bar_time_local and abs_time. Could you please elaborate how this works?

  3. What is the intuition behind chords? I feel like I have understood the first 3 fields, (delta_time, dur_vel and pat_ptc) and I have a suspicion what the bar_time is used for, but the remaining part I don't understand what the role of the chord_counter is and why it is compared with a hardcoded value of 50.

Once again, I apologize for asking all these quesitons. I hope you see that I have invested a lot of time into understanding this and my hope is, that as you seem interested in sharing your code, my difficulties understanding your code might help you developing it in a more accessible format.

@asigalov61
Copy link
Owner

@mightimatti I apologize for the delayed response...

Thank you for your suggestions about parameters config. This is indeed how it should be done so that the code is easier to read and to use. I will definitely consider that for my future implementations and projects.

To answer your questions:

  1. delta_start_time values/range is not dependent on the seq_len. delta_start_time values/ranges are usually determined by the desired resolution for note timings and also by the model's dictionary size requirements.

In GMT, since I use ms timings, first the MIDI is converted from ticks to ms and then I further downsample the timings range by dividing them by some value (usually 8 or 16) which produces the range of 512 or 256 respectively for delta_start_times and durations. In other words, I dowsample original MIDI timings (start times and durations) so that they could be in a reasonable range while preserving sufficient resolution to avoid damaging the music structure.

For example, in my production versions of GMT, I first converted timings from ticks to ms, then divided them by 16, which gave me a range of timings values from 0 to 255 (256 values) for delta_start times and durations, which in turn is equivalent of 64 values per each second of time (1000ms), with max time being 4s (4000 ms). So if my delta_start _time is 1 second, the value to describe it would be 64. If the delta_start_time is 4 seconds, the value to describe it would be 255. Same for durations too.

Hope this makes sense in regard to my timings encoding and conversion.

  1. bar counters are not really used for anything. It was incorporated into GMT to see if it would help with long_term structure and if it would help the model to produce more stable output. Unfortunately, adding these to encoding do not seem to improve the result so I did not use them in the production versions of GMT and I suggest that you also do not use it and simply disregard/remove it from the code. In the provided code/colabs it is mostly present for demonstration of how it can be done, not for any practical use/value.

From my experience, MIDI encoding does not matter at all and it is best to keep it simple and as compact as possible for best results. The best types of encoding from my experience are triplets and quads for each MIDI note. It does not degrade performance while keeping things simple and efficient.

You can also use asymmetrical encoding to further optimize your implementation. Asymmetrical encoding (such as was used in MuseNet) produces same results as the symmetrical one while further increasing efficiency at a minimal training loss increase.

Hope this also makes sense and I am sorry if my implementation is difficult to understand. I am mostly concern with end-user experience, so I sometimes drop the ball on the code/implementation itself which amy be important for devs.

Alex.

@mightimatti
Copy link
Author

Hi Alex,
I'm sorry it's been a minute since you posted and I haven't gotten back yet. I got sidetracked with some other deadlines at university. I will be resuming working on this shortly.
Thank you for the insights you provided. As soon as I finish my custom pre-processor I'll share the results. Out of curiosity: Have you experimented with a 1D-difussion net for the purpose of symbolic composition?

@asigalov61
Copy link
Owner

@mightimatti No worries. I was also very busy with my projects so I totally understand :)

Yes, feel free to PR anything you think may be useful. And I would love to chat more and check out your work too :)

No, I never tried 1D diffusion because I do not think it will work better than autoregressive transformers. But I wanted to try to fine-tune a diffusion model on symbolic music in form of images to see if it will work. Why do you ask though? Do you think 1D diffusion will work on symbolic (MIDI) music?

Alex.

PS. I did some repos clean-up and update, including tegridy-tools/TMIDIX, so check it out. I think it is much better now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants