版權(quán)說明:本文檔由用戶提供并上傳,收益歸屬內(nèi)容提供方,若內(nèi)容存在侵權(quán),請(qǐng)進(jìn)行舉報(bào)或認(rèn)領(lǐng)
文檔簡(jiǎn)介
1、Please cite this article in press as: Torrecilla J.L., Romo J., Data learning from big data. Statistics and Probability Letters (2018), https://doi.org/10.1016/j.spl.2018.02.038.Statistics and Probability Letters ( ) –Co
2、ntents lists available at ScienceDirectStatistics and Probability Lettersjournal homepage: www.elsevier.com/locate/staproData learning from big dataJosé L. Torrecilla a,*, Juan Romo ba Institute UC3M-BS of Financial
3、 Big Data, Universidad Carlos III de Madrid, Spain b Department of Statistics. Universidad Carlos III de Madrid, Spaina r t i c l e i n f oArticle history: Available online xxxxMSC: 00-01 99-00Keywords: Big data Data lea
4、rning Statisticsa b s t r a c tTechnology is generating a huge and growing availability of observations of diverse nature. This big data is placing data learning as a central scientific discipline. It includes collec- ti
5、on, storage, preprocessing, visualization and, essentially, statistical analysis of enormous batches of data. In this paper, we discuss the role of statistics regarding some of the issues raised by big data in this new p
6、aradigm and also propose the name of data learning to describe all the activities that allow to obtain relevant knowledge from this new source of information. © 2018 Elsevier B.V. All rights reserved.1. Introduction
7、Big data is one of the most fashionable concepts nowadays: everybody talks about it, is permanently in the media, andcompanies and governments try to exploit the new amount of available information (Lohr, 2012; John Walk
8、er, 2014; James, 2018). The ideas behind this interest are mainly two. First, the fact that at present, most activities generate data (with very low cost) that contains (potentially valuable) information. The second one
9、is well summarized in John Walker (2014): ‘‘Data- driven decisions are better decisions - it is as simple as that. Using big data enables managers to decide on the basis of evidence rather than intuition’’. The opportuni
10、ties offered by big data are undeniable, but there is still a debate about the scope and usefulness of this (Secchi, 2018; Bühlmann and van de Geer, 2018). The opinions of the most fervent followers speak of the end
11、 of the theory and the models and, in articles like the controversial ‘‘The end of theory’’ (Anderson, 2008) they argue that ‘‘with enough data, the numbers speak for themselves’’. On the other hand, there have been more
12、 critical voices that question whether the optimism and the faith that is being put into the big data are really justified. In this line, Tim Harford wonders if ‘‘we are making a mistake’’ in another provocative article
13、(Harford, 2014). In this paper we will review some of the big data aspects that can generate doubts from the point of view of a statistician trying to scrutinize if the data are sufficient by themselves or it is necessar
14、y to give them a sense.First, it is convenient to be more specific. Although there is no single definition, there seems to be a certain consensusthat big data encompasses the study of problems so ‘‘Big’’ that conventiona
15、l tools and models cannot handle them, either because they are not adequate or because they require too much time. In any case, whatever the definition we choose or where we put the emphasis, what is clear is that curren
16、t technology generates huge amounts of data, so we have to be able to extract the best information from them and use it to make the best decisions. How to get it and the challenges associated to this new framework have b
17、ecome common discussion topic in the last years (Lynch, 2008; Fan et al., 2014; Gandomi and Haider, 2015) and the best way to tackle the problem has also been subject of debate. As an example, we can cite the former pape
18、r by Breiman et al. (2001) about the two cultures of statistical modeling: stochastic models and algorithms (see Dunson, 2018 for recent discussion in the context of big data). In what follows, we discuss the role of sta
19、tistics regarding some of the* Corresponding author.E-mail address: joseluis.torrecilla@uc3m.es (J.L. Torrecilla).https://doi.org/10.1016/j.spl.2018.02.038 0167-7152/© 2018 Elsevier B.V. All rights reserved.Please c
20、ite this article in press as: Torrecilla J.L., Romo J., Data learning from big data. Statistics and Probability Letters (2018), https://doi.org/10.1016/j.spl.2018.02.038.J.L. Torrecilla, J. Romo / Statistics and Probabil
21、ity Letters ( ) – 3(2014) uses random forest in a distributed framework for variable selection, an evolutionary algorithm based on MapReduce is proposed in Peralta et al. (2015) for big data classification and Bolón
22、-Canedo et al. (2015) provides a survey of some feature selection algorithms for big data problems ranging from DNA microarray analysis to face recognition. Furthermore, parallel penalized coordinates descent methods as
23、those used for lasso optimization and others are studied in Richtárik and Taká? (2016), and fast versions of other traditional algorithms have been proposed, as this version of PCA based on randomization (Abrah
24、am and Inouye, 2014). Finally, reduction methods applied to more complex structures can also be found, for example subgraphs (Pan et al., 2015).An important example of big data with high dimensional observations is funct
25、ional data. Functional Data Analysis (FDA)deals with objects of infinite dimension and problem related with this functional nature are sometimes similar those associated with big data (see, e.g., Ahmed, 2017; Goia and Vi
26、eu, 2016). In particular, a very relevant question in functional data analysis is dimension reduction (see Vieu, 2018).Despite the number of works about these topics, there is still room for research. Most of these propo
27、sals just applyparallelization tools (as those commented before), stochastic search or on-line learning techniques aiming at accelerating existent algorithms. This computational efficiency is necessary, and it is worth t
28、o explore new searching strategies, matrix processing techniques, and so on. In fact, the mere parallelization of existing methods is not trivial since some very basic functionalities are not easily parallelizable (see,
29、e.g., the median, which is not the median of the medians). Otherwise, other aspects can be also considered in order to answer natural questions like: What are the relevant variables in this heterogeneous context? How can
30、 we find and compare them? How to deal in a non trivial way with unstructured information?These and other problems are largely motivated by the variability and characteristics of the observations in big dataproblems (not
31、 just the dimension). Here, there is no longer a set of homogeneous measures, or even numeric items. The new variables can come from very different contexts and we do not even know how to code them adequately to extract
32、the useful information. We must explore new ways to develop effective indicators, starting, for example, from opinions in social networks or from telephone calls to understand problems such as the level of satisfaction o
33、r recommendation systems. The complexity of some structures means that in some cases we lack even basic descriptive statistics and so it is necessary to propose new metrics or ways to establish relationships in order to
34、be able to measure and compare (Marron and Alonso, 2014). Finally, it will also be important to study the properties of new methods and measures and to develop new visualization tools (Keim et al., 2013).4. Big samplesWe
35、 have been dealing with the dimension problem, even with certain levels of heterogeneity, for many years and we havesome experience to face them. Nevertheless, big data has brought a series of new problems related to the
36、 very high number (and often, low quality) of observations that were not expected a few years ago. On the contrary, from a classical point of view, having a lot of data was always considered as something positive that wo
37、uld make the models converge and practically allow us to achieve population results. We have often read and said phrases like ‘‘the problem is that we do not have enough data’’ and explain to our students how having enou
38、gh data makes everything work. Well, now we have a lot of data and not everything works; in fact, what was promised as a blessing is rather a kind of curse with unexpected consequences that open new lines for research. B
39、elow we will briefly discuss some of the problems that have arisen due to the huge size, and sometimes low quality, of the samples.Heterogeneous sources. In big data applications, it is no longer common for observations
40、to come from a single source with a single manageable coding. Quite the contrary, what we are usually interested in is capturing data from different sources (with different levels of information, preferably complementary
41、) which, unfortunately, provides samples collected in very different ways (Gandomi and Haider, 2015). Let us think, for example, in a big company. It will probably be interested in its expenses and revenues, costs and be
42、nefits, but also in the performance of the last advertising campaign, in the relationships between its clients, in its public image or in levels of satisfaction among its clients. Data related to some of these topics wil
43、l be obtained in a traditional and structured way (tabulated) but it will be also necessary to analyze comments on social networks, customer calls, survey results, etc., and include, as far as possible, exogenous variabl
44、es such as demographic and economic indicators. Hence, this whole process combines structured, non-structured and semi-structured information from which one have to create comparable variables that capture the really use
45、ful information. Therefore, it is necessary to combine sources efficiently and ‘consistently’. Currently, companies are already doing it, with more or less success, at purely heuristic levels but we can look for a sense
46、in these combinations.Subpopulations-clusters. Basic hypotheses when working with statistical inference models are independence and identical distribution of the data. Moreover, there exist models for dependent data and
47、ways to detect subsamples with different distributions. However, what in conventional statistics are particular cases, in big data problems becomes the general rule (Fan et al., 2014). Since reality is complex and big da
48、ta capture big pieces of this reality, it seems natural to have complex distributions including mixtures of different populations. This leads to new settings, starting with the failure of traditional models. Having sever
49、al subgroups introduce difficulties in the study and analysis of the data, such as the Yule– Simpson effect. It seems interesting to find the different groups that exist in the sample and characterize them for their stud
50、y, perhaps applying different models to different groups. Then it is relevant to decide what to do with the minority groups that are no longer so, are they outliers? Should we ignore them? Correct them? Study them separa
溫馨提示
- 1. 本站所有資源如無特殊說明,都需要本地電腦安裝OFFICE2007和PDF閱讀器。圖紙軟件為CAD,CAXA,PROE,UG,SolidWorks等.壓縮文件請(qǐng)下載最新的WinRAR軟件解壓。
- 2. 本站的文檔不包含任何第三方提供的附件圖紙等,如果需要附件,請(qǐng)聯(lián)系上傳者。文件的所有權(quán)益歸上傳用戶所有。
- 3. 本站RAR壓縮包中若帶圖紙,網(wǎng)頁(yè)內(nèi)容里面會(huì)有圖紙預(yù)覽,若沒有圖紙預(yù)覽就沒有圖紙。
- 4. 未經(jīng)權(quán)益所有人同意不得將文件中的內(nèi)容挪作商業(yè)或盈利用途。
- 5. 眾賞文庫(kù)僅提供信息存儲(chǔ)空間,僅對(duì)用戶上傳內(nèi)容的表現(xiàn)方式做保護(hù)處理,對(duì)用戶上傳分享的文檔內(nèi)容本身不做任何修改或編輯,并不能對(duì)任何下載內(nèi)容負(fù)責(zé)。
- 6. 下載文件中如有侵權(quán)或不適當(dāng)內(nèi)容,請(qǐng)與我們聯(lián)系,我們立即糾正。
- 7. 本站不保證下載資源的準(zhǔn)確性、安全性和完整性, 同時(shí)也不承擔(dān)用戶因使用這些下載資源對(duì)自己和他人造成任何形式的傷害或損失。
最新文檔
- [雙語(yǔ)翻譯]大數(shù)據(jù)外文翻譯--大數(shù)據(jù)中的數(shù)據(jù)學(xué)習(xí)
- [雙語(yǔ)翻譯]大數(shù)據(jù)外文翻譯--大數(shù)據(jù)中的數(shù)據(jù)學(xué)習(xí)中英全
- 2018年大數(shù)據(jù)外文翻譯--大數(shù)據(jù)中的數(shù)據(jù)學(xué)習(xí)(英文).PDF
- [雙語(yǔ)翻譯]大數(shù)據(jù)外文翻譯--中小企業(yè)增長(zhǎng)的大數(shù)據(jù)概述(英文)
- 2018年大數(shù)據(jù)外文翻譯--大數(shù)據(jù)中的數(shù)據(jù)學(xué)習(xí)
- 2018年大數(shù)據(jù)外文翻譯--大數(shù)據(jù)中的數(shù)據(jù)學(xué)習(xí).DOCX
- [雙語(yǔ)翻譯]大數(shù)據(jù)外文翻譯--中小企業(yè)增長(zhǎng)的大數(shù)據(jù)概述
- 大數(shù)據(jù)挖掘外文翻譯—大數(shù)據(jù)挖掘研究
- [雙語(yǔ)翻譯]大數(shù)據(jù)外文翻譯--中小企業(yè)增長(zhǎng)的大數(shù)據(jù)概述中英全
- 大數(shù)據(jù)挖掘外文翻譯—大數(shù)據(jù)挖掘研究(原文)
- 2016年大數(shù)據(jù)外文翻譯--中小企業(yè)增長(zhǎng)的大數(shù)據(jù)概述(英文).PDF
- 外文翻譯--關(guān)于大數(shù)據(jù)那些事
- 《什么是大數(shù)據(jù)的大數(shù)據(jù)》漢譯及翻譯報(bào)告.pdf
- 2016年大數(shù)據(jù)外文翻譯--中小企業(yè)增長(zhǎng)的大數(shù)據(jù)概述
- 2016年大數(shù)據(jù)外文翻譯--中小企業(yè)增長(zhǎng)的大數(shù)據(jù)概述.DOCX
- [雙語(yǔ)翻譯]--外文翻譯--數(shù)字化、“大數(shù)據(jù)”及會(huì)計(jì)信息的轉(zhuǎn)化(節(jié)選)
- [雙語(yǔ)翻譯]--外文翻譯--數(shù)字化、“大數(shù)據(jù)”及會(huì)計(jì)信息的轉(zhuǎn)化(原文)
- 大數(shù)據(jù)大數(shù)據(jù)的實(shí)際應(yīng)用
- [雙語(yǔ)翻譯]--外文翻譯--數(shù)字化、“大數(shù)據(jù)”及會(huì)計(jì)信息的轉(zhuǎn)化中英全
- 大數(shù)據(jù)數(shù)據(jù)挖掘
評(píng)論
0/150
提交評(píng)論