This commit is contained in:
Sander Wood 2023-04-23 01:02:56 +08:00 коммит произвёл GitHub
Родитель c70065ea42
Коммит 5de72382f3
Не найден ключ, соответствующий данной подписи
Идентификатор ключа GPG: 4AEE18F83AFDEB23
1 изменённых файлов: 1 добавлений и 1 удалений

Просмотреть файл

@ -4,7 +4,7 @@ The intellectual property of the CLaMP project is owned by the [Central Conserva
In [CLaMP: Contrastive Language-Music Pre-training for Cross-Modal Symbolic Music Information Retrieval](https://ai-muzic.github.io/clamp/), we introduce a solution for cross-modal symbolic MIR that utilizes contrastive learning and pre-training. The proposed approach, CLaMP: Contrastive Language-Music Pre-training, which learns cross-modal representations between natural language and symbolic music using a music encoder and a text encoder trained jointly with a contrastive loss. To pre-train CLaMP, we collected a large dataset of 1.4 million music-text pairs. It employed text dropout as a data augmentation technique and bar patching to efficiently represent music data which reduces sequence length to less than 10%. In addition, we developed a masked music model pre-training objective to enhance the music encoder's comprehension of musical context and structure. CLaMP integrates textual information to enable semantic search and zero-shot classification for symbolic music, surpassing the capabilities of previous models. To support the evaluation of semantic search and music classification, we publicly release [WikiMusicText](https://huggingface.co/datasets/sander-wood/wikimusictext) (WikiMT), a dataset of 1010 lead sheets in ABC notation, each accompanied by a title, artist, genre, and description. In comparison to state-of-the-art models that require fine-tuning, zero-shot CLaMP demonstrated comparable or superior performance on score-oriented datasets.
<p align="center"><img src="../img/clamp_clamp.png" width="400"><br/>The architecture of CLaMP, including two encoders - one for music and one for text - trained jointly with a contrastive loss to learn cross-modal representations.</p>
<p align="center"><img src="../img/clamp_clamp.png" width="500"><br/>The architecture of CLaMP, including two encoders - one for music and one for text - trained jointly with a contrastive loss to learn cross-modal representations.</p>
Two variants of CLaMP are introduced: [CLaMP-S/512](https://huggingface.co/sander-wood/clamp-small-512) and [CLaMP-S/1024](https://huggingface.co/sander-wood/clamp-small-1024). Both models consist of a 6-layer music encoder and a 6-layer text encoder with a hidden size of 768. While CLaMP-S/512 accepts input music sequences of up to 512 tokens in length, CLaMP-S/1024 allows for up to 1024 tokens. The maximum input length for the text encoder in both models is 128 tokens. These models are part of [Muzic](https://github.com/microsoft/muzic), a research initiative on AI music that leverages deep learning and artificial intelligence to enhance music comprehension and generation.