* add LICENSE and README.md for review

* add gitattributes to track large files

* add query.bin and truth.bin

* add partial document vectors

* add partial document vectors

* update readme

* update Readme.md

* update the encoding model name

* add query log data

Co-authored-by: Qi Chen <cheqi@microsoft.com>
This commit is contained in:
MaggieQi 2021-11-15 14:27:30 +08:00 коммит произвёл GitHub
Родитель 0bf0ef5126
Коммит ebbc061f5f
Не найден ключ, соответствующий данной подписи
Идентификатор ключа GPG: 4AEE18F83AFDEB23
2 изменённых файлов: 5 добавлений и 2 удалений

Просмотреть файл

@ -10,7 +10,7 @@ This dataset contains:
* [vectors.bin](vectors.bin): It contains 1,402,020,720 100-dimensional int8-type document descriptors.
* [query.bin](query.bin): It contains 29,316 100-dimensional int8-type query descriptors.
* [truth.bin](truth.bin): It contains 100 nearest ground truthinclude vector ids and distances) of 29,316 queries according to L2 distance.
* [query_log.bin](query_log.bin): It contains 94,162 100-dimensional int8-type history query descriptors.
## How to read the vectors, queries, and truth
@ -44,7 +44,7 @@ ftruth = open('truth.bin', 'rb')
t_count = struct.unpack('i', ftruth.read(4))[0]
topk = struct.unpack('i', ftruth.read(4))[0]
truth_vids = np.frombuffer(ftruth.read(t_count * topk * 4), dtype=np.int32).reshape((t_count, topk))
truth_distances = np.frombuffer(ftruth.read(t_count * topk * 4), dtype=np.float).reshape((t_count, topk))
truth_distances = np.frombuffer(ftruth.read(t_count * topk * 4), dtype=np.float32).reshape((t_count, topk))
```
## License

Просмотреть файл

@ -0,0 +1,3 @@
version https://git-lfs.github.com/spec/v1
oid sha256:d4768ea964b595302b08f0b9ef484d7003b2e7848f2f7baf9e2b4bbb3bb36c49
size 9416208