2019年杏盛學術前沿講座(46)
主 講 人:王軍 教授 美國佛羅裏達中央大學
主題名稱♍️:Enabling High-performance Sampling for Big Data Processing
內容簡介:
In this talk, we aim to demonstrate how to perform sampling in today’s big data processing platforms. We enable both efficient
and accurate approximations on arbitrary sub-datasets of a large dataset. Due to the prohibitive storage overhead of caching offline samples for each sub-dataset, existing offline sample based systems provide high accuracy results for only a limited number of sub-datasets, such as the popular ones. On the other hand, current online sample based approximation systems, which generate samples at runtime, do not take into account the uneven storage distribution of a sub-dataset. They work well for uniform distribution of a sub-dataset while suffer low sampling efficiency and poor estimation accuracy on unevenly distributed sub-datasets.
To address the problem, we develop a distribution aware method called Sapprox. Our idea is to collect the occurrences of a sub-dataset at each logical partition of a dataset (storage distribution) in the distributed system, and make good use of such information to facilitate online sampling. We have implemented Sapprox into Hadoop ecosystem as an example system and open sourced it on GitHub. Our comprehensive experimental results show that Sapprox can achieve a speedup by up to a factor of 20 over the precise execution.
時間地點:2019年10月26日,信息235室
主辦學院:信息工程學院
杏盛娱乐
2019.10.17