-
Notifications
You must be signed in to change notification settings - Fork 12
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
yahoo streaming benchmark on briskstream ? #5
Comments
Hi, Currently, no. You are very welcome to implement it on briskstream. The API is almost identical to Storm (except some syntactic sugars), so it should be straightforward if the application is implemented in Storm. Tony. |
Dear Shuhao, When I start to run the WordCount app with profiling referring to issue #4
I encountered the following problem. It indicates that I didn't prepare dataset for WordCount. I need to understand how it works and then implement YSB. |
Hi, For wordcount, I used The input source is Linux dictionary, normally at /usr/share/dict/words. Besides, if you just need to run the program, you can pass --native without profiling or optimization involved. Tony |
Hi, |
Hi, Sure, please also remember to configure your machine specification accordingly as done in ./briskstream/common/src/main/java/applications/Platform.java Then, specify which machine you are using by passing "--machine" argument. Thanks! Tony |
Hi Tony, For details, you can also view my source code **Does the throughput make sense ? ** Thanks in advance. |
Hi Zongxiong, The report of throughput of each operator (executor specifically) is measured by the number of function invocation divided by the total duration. This is more for debug purpose. The reported throughput of different operators may vary significantly because of the queue. So, in your case, Spout is running at much higher speed and its output is accumulated at its output queue. Regards, |
Hey Tony, this is a follow-up to the questions asked by @chenzongxiong (I've used some parts of his logic to implement my solution). I am trying to implement the YahooBenchmark too with BriskStream and make a fair comparison with my system. These are the parameters I pass to the BriskStreamRunner (I've configured my machine details -- 2 sockets with 8 cores per socket and 64gb RAM):
In queries that have some aggregation logic, the throughput can only be measured from the source, because the throughput of sinks is significantly smaller based on the aggregation. By reading previous issues and playing with the code, I've seen there is no window semantics, thus I am emulating count-based windows (the logic is still incorrect, but let's assume it does the job). For simplicity, I have a spout that replays data, followed by a bolt that does all the processing and finally a sink that just receives data. This is how the core logic looks like. I would be grateful if you could suggest what I am doing wrong here.
Cheers, |
Hi George, I have three comments. First, I still suggest to measure performance by Second, when Third, in its current stage, BriskStream assumes each operator has a constant workload for each input tuple. This does not hold universally. That's all I have in mind now. Thanks! Tony. |
Hi Tony, Thanks for the prompt response and help. You are right about the tumbling window representation. I will add the code/logic from Storm for handling them properly, but this was just the first step. The However, the batching indeed increased the performance, along with the removal of some redundant mem copies! Regarding the throughput, in my mind, it is the number of ingested and processed records per time unit (from source). Cheers, |
Is there any implementation for YSB on briskstream.
Or is there any hint for for me to implement some new benchmark?
The text was updated successfully, but these errors were encountered: