# ZAP File Format ## Legend ### Sections |========| | | section |========| ### Fixed-size fields |--------| |----| |--| |-| | | uint64 | | uint32 | | uint16 | | uint8 |--------| |----| |--| |-| ### Varints |~~~~~~~~| | | varint(up to uint64) |~~~~~~~~| ### Arbitrary-length fields |--------...---| | | arbitrary-length field (string, vellum, roaring bitmap) |--------...---| ### Chunked data [--------] [ ] [--------] ## Overview Footer section describes the configuration of particular ZAP file. The format of footer is version-dependent, so it is necessary to check `V` field before the parsing. |==================================================| | Stored Fields | |==================================================| |-----> | Stored Fields Index | | |==================================================| | | Dictionaries + Postings + DocValues | | |==================================================| | |---> | DocValues Index | | | |==================================================| | | | Fields | | | |==================================================| | | |-> | Fields Index | | | | |========|========|========|========|====|====|====| | | | | D# | SF | F | FDV | CF | V | CC | (Footer) | | | |========|====|===|====|===|====|===|====|====|====| | | | | | | |-+-+-----------------| | | | |--------------------------| | |-------------------------------------| D#. Number of Docs. SF. Stored Fields Index Offset. F. Field Index Offset. FDV. Field DocValue Offset. CF. Chunk Factor. V. Version. CC. CRC32. ## Stored Fields Stored Fields Index is `D#` consecutive 64-bit unsigned integers - offsets, where relevant Stored Fields Data records are located. 0 [SF] [SF + D# * 8] | Stored Fields | Stored Fields Index | |================================|==================================| | | | | |--------------------| ||--------|--------|. . .|--------|| | |-> | Stored Fields Data | || 0 | 1 | | D# - 1 || | | |--------------------| ||--------|----|---|. . .|--------|| | | | | | |===|============================|==============|===================| | | |-------------------------------------------| Stored Fields Data is an arbitrary size record, which consists of metadata and [Snappy](https://github.com/golang/snappy)-compressed data. Stored Fields Data |~~~~~~~~|~~~~~~~~|~~~~~~~~...~~~~~~~~|~~~~~~~~...~~~~~~~~| | MDS | CDS | MD | CD | |~~~~~~~~|~~~~~~~~|~~~~~~~~...~~~~~~~~|~~~~~~~~...~~~~~~~~| MDS. Metadata size. CDS. Compressed data size. MD. Metadata. CD. Snappy-compressed data. ## Fields Fields Index section located between addresses `F` and `len(file) - len(footer)` and consist of `uint64` values (`F1`, `F2`, ...) which are offsets to records in Fields section. We have `F# = (len(file) - len(footer) - F) / sizeof(uint64)` fields. (...) [F] [F + F#] | Fields | Fields Index. | |================================|================================| | | | | |~~~~~~~~|~~~~~~~~|---...---|||--------|--------|...|--------|| ||->| Dict | Length | Name ||| 0 | 1 | | F# - 1 || || |~~~~~~~~|~~~~~~~~|---...---|||--------|----|---|...|--------|| || | | | ||===============================|==============|=================| | | |----------------------------------------------| ## Dictionaries + Postings Each of fields has its own dictionary, encoded in [Vellum](https://github.com/couchbase/vellum) format. Dictionary consists of pairs `(term, offset)`, where `offset` indicates the position of postings (list of documents) for this particular term. |================================================================|- Dictionaries + | | Postings + | | DocValues | Freq/Norm (chunked) | | [~~~~~~|~~~~~~~~~~~~~~~~~~~~~~~~~~~~~] | | |->[ Freq | Norm (float32 under varint) ] | | | [~~~~~~|~~~~~~~~~~~~~~~~~~~~~~~~~~~~~] | | | | | |------------------------------------------------------------| | | Location Details (chunked) | | | [~~~~~~|~~~~~|~~~~~~~|~~~~~|~~~~~~|~~~~~~~~|~~~~~] | | | |->[ Size | Pos | Start | End | Arr# | ArrPos | ... ] | | | | [~~~~~~|~~~~~|~~~~~~~|~~~~~|~~~~~~|~~~~~~~~|~~~~~] | | | | | | | |----------------------| | | | Postings List | | | | |~~~~~~~~|~~~~~|~~|~~~~~~~~|-----------...--| | | | |->| F/N | LD | Length | ROARING BITMAP | | | | | |~~~~~|~~|~~~~~~~~|~~~~~~~~|-----------...--| | | | | |----------------------------------------------| | | |--------------------------------------| | | Dictionary | | | |~~~~~~~~|--------------------------|-...-| | | |->| Length | VELLUM DATA : (TERM -> OFFSET) | | | | |~~~~~~~~|----------------------------...-| | | | | |======|=========================================================|- DocValues Index | | | |======|=========================================================|- Fields | | | | |~~~~|~~~|~~~~~~~~|---...---| | | | Dict | Length | Name | | | |~~~~~~~~|~~~~~~~~|---...---| | | | |================================================================| ## DocValues DocValues Index is `F#` pairs of varints, one pair per field. Each pair of varints indicates start and end point of DocValues slice. |================================================================| | |------...--| | | |->| DocValues |<-| | | | |------...--| | | |==|=================|===========================================|- DocValues Index ||~|~~~~~~~~~|~~~~~~~|~~| |~~~~~~~~~~~~~~|~~~~~~~~~~~~|| || DV1 START | DV1 STOP | . . . . . | DV(F#) START | DV(F#) END || ||~~~~~~~~~~~|~~~~~~~~~~| |~~~~~~~~~~~~~~|~~~~~~~~~~~~|| |================================================================| DocValues is chunked Snappy-compressed values for each document and field. [~~~~~~~~~~~~~~~|~~~~~~|~~~~~~~~~|-...-|~~~~~~|~~~~~~~~~|--------------------...-] [ Doc# in Chunk | Doc1 | Offset1 | ... | DocN | OffsetN | SNAPPY COMPRESSED DATA ] [~~~~~~~~~~~~~~~|~~~~~~|~~~~~~~~~|-...-|~~~~~~|~~~~~~~~~|--------------------...-] Last 16 bytes are description of chunks. |~~~~~~~~~~~~...~|----------------|----------------| | Chunk Sizes | Chunk Size Arr | Chunk# | |~~~~~~~~~~~~...~|----------------|----------------|