Dataset opensource (#247)

* add LICENSE and README.md for review * add gitattributes to track large files * add query.bin and truth.bin * add partial document vectors * add partial document vectors * update readme * update Readme.md * update the encoding model name * add query log data Co-authored-by: Qi Chen <cheqi@microsoft.com>
2021-11-15 14:27:30 +08:00 · 2021-11-15 14:27:30 +08:00 · ebbc061f5f
--- a/datasets/SPACEV1B/README.md
+++ b/datasets/SPACEV1B/README.md
@ -10,7 +10,7 @@ This dataset contains:
 * [vectors.bin](vectors.bin): It contains 1,402,020,720 100-dimensional int8-type document descriptors.
 * [query.bin](query.bin):  It contains 29,316 100-dimensional int8-type query descriptors.
 * [truth.bin](truth.bin): It contains 100 nearest ground truth（include vector ids and distances) of 29,316 queries according to L2 distance.
-
+ * [query_log.bin](query_log.bin): It contains 94,162 100-dimensional int8-type history query descriptors.

 ## How to read the vectors, queries, and truth

@ -44,7 +44,7 @@ ftruth = open('truth.bin', 'rb')
 t_count = struct.unpack('i', ftruth.read(4))[0]
 topk = struct.unpack('i', ftruth.read(4))[0]
 truth_vids = np.frombuffer(ftruth.read(t_count * topk * 4), dtype=np.int32).reshape((t_count, topk))
-truth_distances = np.frombuffer(ftruth.read(t_count * topk * 4), dtype=np.float).reshape((t_count, topk))
+truth_distances = np.frombuffer(ftruth.read(t_count * topk * 4), dtype=np.float32).reshape((t_count, topk))
 ```

 ## License
--- a/datasets/SPACEV1B/query_log.bin
+++ b/datasets/SPACEV1B/query_log.bin
@ -0,0 +1,3 @@
+version https://git-lfs.github.com/spec/v1
+oid sha256:d4768ea964b595302b08f0b9ef484d7003b2e7848f2f7baf9e2b4bbb3bb36c49
+size 9416208