First of all, I want to thank my teammate Wu Jiaqi, he has contributed a great deal to this project.
- language: java
- algorithm: TPMMS
Sort main memory-sized partitions to produce the sorted sublists.
- compute how many tuples a block can hold.
- compute how many blocks main memory can hold (B)
- compute how many times(n) these data should be sort in memory and devide the file to n sublists.
- read every sublist to memory to sort by quick sort.
- write the sorted sublists to a new file in disk.
Deduplicating and Merging data
- Put all sublists into memory and compute how many tuples a sublists should hold.
- compute the smallest element s and its sublist l from the first element of these sublists.
- check whether the last element of the buffer list (A) equal this element or not:
if equal:
compute the newest date n then substitude the last element as the newest data.
#A[-1] = n
else:
append this element to buffer list.
#A.append(s) - pop the first element from l and append a new element from disk.
l.pop(0)
l.append(new_element) - if the buffer list (A) is full, save it in disk and creat a new empty buffer list until all sublists have been sorted.