Dataset opensource (#247)
* add LICENSE and README.md for review * add gitattributes to track large files * add query.bin and truth.bin * add partial document vectors * add partial document vectors * update readme * update Readme.md * update the encoding model name * add query log data Co-authored-by: Qi Chen <cheqi@microsoft.com>
This commit is contained in:
Родитель
0bf0ef5126
Коммит
ebbc061f5f
|
@ -10,7 +10,7 @@ This dataset contains:
|
|||
* [vectors.bin](vectors.bin): It contains 1,402,020,720 100-dimensional int8-type document descriptors.
|
||||
* [query.bin](query.bin): It contains 29,316 100-dimensional int8-type query descriptors.
|
||||
* [truth.bin](truth.bin): It contains 100 nearest ground truth(include vector ids and distances) of 29,316 queries according to L2 distance.
|
||||
|
||||
* [query_log.bin](query_log.bin): It contains 94,162 100-dimensional int8-type history query descriptors.
|
||||
|
||||
## How to read the vectors, queries, and truth
|
||||
|
||||
|
@ -44,7 +44,7 @@ ftruth = open('truth.bin', 'rb')
|
|||
t_count = struct.unpack('i', ftruth.read(4))[0]
|
||||
topk = struct.unpack('i', ftruth.read(4))[0]
|
||||
truth_vids = np.frombuffer(ftruth.read(t_count * topk * 4), dtype=np.int32).reshape((t_count, topk))
|
||||
truth_distances = np.frombuffer(ftruth.read(t_count * topk * 4), dtype=np.float).reshape((t_count, topk))
|
||||
truth_distances = np.frombuffer(ftruth.read(t_count * topk * 4), dtype=np.float32).reshape((t_count, topk))
|
||||
```
|
||||
|
||||
## License
|
||||
|
|
|
@ -0,0 +1,3 @@
|
|||
version https://git-lfs.github.com/spec/v1
|
||||
oid sha256:d4768ea964b595302b08f0b9ef484d7003b2e7848f2f7baf9e2b4bbb3bb36c49
|
||||
size 9416208
|
Загрузка…
Ссылка в новой задаче