Information

Note: If experimenting with code on jupyter notebook, to prevent OOM, run with

jupyter notebook --NotbookApp.iopub_Data_Rate_Limit=1e10

Requirements

Python 3.6+
Pandas

Running of Experiments

python main.py # change block_size to 100/500 in structures.py

Running of Tests

python -m unittest

Relevant Lectures

Lecture 03: Database Systems Storage
- In terms of packing fields into records
  - Since our data is static, we can use fixed format with fixed length
  - Don't need record header
- In terms of packing records into a data block
  - Unspanned method
  - Sequencing should be done for averageRating as all queries are on averageRating (we can just sort the data before inserting into db)
Lecture 07: B+ Tree (1)
Lecture 08: B+ Tree (2)

Data Analysis

Analysis of data.tsv (if interested, look at data_analysis.py and run python data_analysis.py)

First row in file is the columns (tconst, averageRating, numVotes)
Second to last rows are data rows, 1 row == 1 record
Maximum length of tconst is 10
max(averageRating) is 10.0 while min(averageRating) is 1.0
max(numVotes) is 2279223 while min(numVotes) is 5
Total number of records is 1070318
number of decimal places for averageRating < 2
Means we can use fixed format with fixed length for record

Record format

10 bytes for tconst (fixed string of length 10), 4 bytes for averageRating (unsigned int), 4 bytes for numVotes

Total 18 bytes for a record

Block format

Data Block
- Header
  - 1 byte (for denoting if block is data/index_root/index_non_leaf/index_leaf)
  - 4 bytes (for holding block id)
  - 4 bytes (next free offset)
  - 4 bytes (for holding record size)
- Data
  - 18n bytes (records)
  - Each data block can hold at most (block_size - 13) // 18 records
Index Block
- Header
  - 1 byte (for denoting if block is data/index_root/index_non_leaf/index_leaf)
  - 4 bytes (for holding block id)
  - 4 bytes (for holding parent block id)
  - 4 bytes (for holding number of keys currently in index_block)
  - 4 bytes (for holding key size)
  - Key size for our case is 14 as we are using tuple of (averageRating, tconst) as augmented key since there are duplicates for averageRating
- Data
  - 14n bytes for keys + 8(n+1) bytes for pointers (pointer is block_id + offset)
  - Each index block can hold at most (block_size - 25) // 22 keys
    - Derivation:
      - block_size - 17 >= 14n + 8(n+1)
      - block_size - 25 >= 22n
      - (block_size - 25) / 22 >= n

Implementation

Disk is an array of Blocks
Block is a bytearray
Use little endian for numbers (4 bytes per number)
- number.to_bytes(4, byteorder='little', signed=False)
- number = int.from_bytes(byte_array, byteorder='little', signed=False)
- e.g. 1000 <=> [232, 2, 0, 0]
For strings, convert each char to ascii value, pad write with zeros (10 bytes per string)
- e.g. "adke" <=> [115, 100, 97, 101, 0, 0, 0, 0, 0, 0]
Represent each float with the IEEE-754 basic 32-bit binary format
Index blocks have pointers that point to index blocks. All index block pointers point to other index blocks except leaf index blocks which point to data blocks
Data blocks contain records

Name		Name	Last commit message	Last commit date
Latest commit History 59 Commits
.gitignore		.gitignore
Project 1.pdf		Project 1.pdf
README.md		README.md
data.tsv		data.tsv
data_analysis.py		data_analysis.py
main.py		main.py
structures.py		structures.py
test_disk.py		test_disk.py
test_utils.py		test_utils.py
tracker.py		tracker.py
tree.py		tree.py
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Information

Note: If experimenting with code on jupyter notebook, to prevent OOM, run with

Requirements

Running of Experiments

Running of Tests

Relevant Lectures

Data Analysis

Record format

Block format

Implementation

About

Releases

Packages

Contributors 3

Languages

nimiew/cz4031

Folders and files

Latest commit

History

Repository files navigation

Information

Note: If experimenting with code on jupyter notebook, to prevent OOM, run with

Requirements

Running of Experiments

Running of Tests

Relevant Lectures

Data Analysis

Record format

Block format

Implementation

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages