昇思25天学习打卡营第25 天|Music Gen

发布于:2024-07-27 ⋅ 阅读:(137) ⋅ 点赞:(0)

Music Gen is music Generating model based on LM.

Music can be created by text or audio prompt.

Music Gen is based on Transformer.

text-> encoder-->hidden_rep-> decoder->music

model = MusicgenForConditionalGeneration.from_pretrained('facebook/musicgen-small')

Two mode: greedy and sampling. (It is said that the latter one is better)

With no prompts

%%time
unconditional_inputs = model.get_unconditional_inputs(num_samples=1)
audio_values = model.generate(**unconditional_inputs, do_sample = True, max_new_tokens = 256)
sampling_rate = model.config.audio_encoder.sampling_rate
scipy.io.wavfile.write('musicgen_out.wav', rate = sampling_rate, data = audio_values[0,0].asnumpy())
Audio(audio_values[0].asnumpy(), rate = sampling_rate)# the method to listen in jupyter

audio_length_in_s = 256/model.config.audio_encoder.frame_rate
audio_length_in_s

With text prompt

processor = AutoProcessor.from_pretrained('facebook/musicgen-small')
inputs = processor(
    text = ['90s pop track with bassy drums and synth', '80s rock song with loud guitars and heavy drums']
    padding = True,
    return_tensors = 'ms'
    )
audio_values = model.generate(**inputs, do_sample=True, guidance_scale=3, max_new_tokens = 256)
scipy.io.wavfile.write('musicgen_out_text.wav', rate = sampling_rate, data = audio_values[0,0].asnumpy())
Audio(audio_values[0].asnumpy(), rate = sampling_rate)

With audio prompt

processor = AuroProcessor.from_pretrained('facebook/musicgen-small')
dataset = load_dataset('sanchit-gandhi/gtzan', split = 'train', streaming = True)
sample = next(iter(dataset))['audio']
sample['array'] = sample['array'][:len(sample['array']) // 2]
inputs = processor(
    audio = sample['array'],
    sampling_rate = sample['sampling_rate'],
    text = ['80s blue track with groovy saxophone'],
    padding = True,
    return_tensors = 'ms'
)
audio_values = model.generate(**inputs, do_sample = True, guidance_scale= 3, max_new_tokens = 256)

Batch generating.

sample = next(iter(dataset))['audio']
sample_1 = sample['array'][:len(sample['array'])//4]
sample_2 = sample['array'][:len(sample['array']) //2]
inputs = processor(
    audio = [sample_1, sample_2],
    sampling_rate = sample['sampling_rate'],
    text = ['80s blues track with groovy saxophone', '90s rock songs with loud guiatars and heavy drums'],
    padding = True,
    return_tensor = 'ms'
)
audio_values = model.generate(**inputs, do_sample = True, guidance_scale = 3, max_new_tokens = 256)

audio_values = processor.batch_decode(audio_values, padding_mask = inputs.padding_mask)

    


网站公告

今日签到

点亮在社区的每一天
去签到