2023年全國碩士研究生考試考研英語一試題真題(含答案詳解+作文范文)_第1頁
已閱讀1頁,還剩4頁未讀 繼續(xù)免費閱讀

下載本文檔

版權(quán)說明:本文檔由用戶提供并上傳,收益歸屬內(nèi)容提供方,若內(nèi)容存在侵權(quán),請進(jìn)行舉報或認(rèn)領(lǐng)

文檔簡介

1、Please cite this article in press as: Torrecilla J.L., Romo J., Data learning from big data. Statistics and Probability Letters (2018), https://doi.org/10.1016/j.spl.2018.02.038.Statistics and Probability Letters ( ) –Co

2、ntents lists available at ScienceDirectStatistics and Probability Lettersjournal homepage: www.elsevier.com/locate/staproData learning from big dataJosé L. Torrecilla a,*, Juan Romo ba Institute UC3M-BS of Financial

3、 Big Data, Universidad Carlos III de Madrid, Spain b Department of Statistics. Universidad Carlos III de Madrid, Spaina r t i c l e i n f oArticle history: Available online xxxxMSC: 00-01 99-00Keywords: Big data Data lea

4、rning Statisticsa b s t r a c tTechnology is generating a huge and growing availability of observations of diverse nature. This big data is placing data learning as a central scientific discipline. It includes collec- ti

5、on, storage, preprocessing, visualization and, essentially, statistical analysis of enormous batches of data. In this paper, we discuss the role of statistics regarding some of the issues raised by big data in this new p

6、aradigm and also propose the name of data learning to describe all the activities that allow to obtain relevant knowledge from this new source of information. © 2018 Elsevier B.V. All rights reserved.1. Introduction

7、Big data is one of the most fashionable concepts nowadays: everybody talks about it, is permanently in the media, andcompanies and governments try to exploit the new amount of available information (Lohr, 2012; John Walk

8、er, 2014; James, 2018). The ideas behind this interest are mainly two. First, the fact that at present, most activities generate data (with very low cost) that contains (potentially valuable) information. The second one

9、is well summarized in John Walker (2014): ‘‘Data- driven decisions are better decisions - it is as simple as that. Using big data enables managers to decide on the basis of evidence rather than intuition’’. The opportuni

10、ties offered by big data are undeniable, but there is still a debate about the scope and usefulness of this (Secchi, 2018; Bühlmann and van de Geer, 2018). The opinions of the most fervent followers speak of the end

11、 of the theory and the models and, in articles like the controversial ‘‘The end of theory’’ (Anderson, 2008) they argue that ‘‘with enough data, the numbers speak for themselves’’. On the other hand, there have been more

12、 critical voices that question whether the optimism and the faith that is being put into the big data are really justified. In this line, Tim Harford wonders if ‘‘we are making a mistake’’ in another provocative article

13、(Harford, 2014). In this paper we will review some of the big data aspects that can generate doubts from the point of view of a statistician trying to scrutinize if the data are sufficient by themselves or it is necessar

14、y to give them a sense.First, it is convenient to be more specific. Although there is no single definition, there seems to be a certain consensusthat big data encompasses the study of problems so ‘‘Big’’ that conventiona

15、l tools and models cannot handle them, either because they are not adequate or because they require too much time. In any case, whatever the definition we choose or where we put the emphasis, what is clear is that curren

16、t technology generates huge amounts of data, so we have to be able to extract the best information from them and use it to make the best decisions. How to get it and the challenges associated to this new framework have b

17、ecome common discussion topic in the last years (Lynch, 2008; Fan et al., 2014; Gandomi and Haider, 2015) and the best way to tackle the problem has also been subject of debate. As an example, we can cite the former pape

18、r by Breiman et al. (2001) about the two cultures of statistical modeling: stochastic models and algorithms (see Dunson, 2018 for recent discussion in the context of big data). In what follows, we discuss the role of sta

19、tistics regarding some of the* Corresponding author.E-mail address: joseluis.torrecilla@uc3m.es (J.L. Torrecilla).https://doi.org/10.1016/j.spl.2018.02.038 0167-7152/© 2018 Elsevier B.V. All rights reserved.Please c

20、ite this article in press as: Torrecilla J.L., Romo J., Data learning from big data. Statistics and Probability Letters (2018), https://doi.org/10.1016/j.spl.2018.02.038.J.L. Torrecilla, J. Romo / Statistics and Probabil

21、ity Letters ( ) – 3(2014) uses random forest in a distributed framework for variable selection, an evolutionary algorithm based on MapReduce is proposed in Peralta et al. (2015) for big data classification and Bolón

22、-Canedo et al. (2015) provides a survey of some feature selection algorithms for big data problems ranging from DNA microarray analysis to face recognition. Furthermore, parallel penalized coordinates descent methods as

23、those used for lasso optimization and others are studied in Richtárik and Taká? (2016), and fast versions of other traditional algorithms have been proposed, as this version of PCA based on randomization (Abrah

24、am and Inouye, 2014). Finally, reduction methods applied to more complex structures can also be found, for example subgraphs (Pan et al., 2015).An important example of big data with high dimensional observations is funct

25、ional data. Functional Data Analysis (FDA)deals with objects of infinite dimension and problem related with this functional nature are sometimes similar those associated with big data (see, e.g., Ahmed, 2017; Goia and Vi

26、eu, 2016). In particular, a very relevant question in functional data analysis is dimension reduction (see Vieu, 2018).Despite the number of works about these topics, there is still room for research. Most of these propo

27、sals just applyparallelization tools (as those commented before), stochastic search or on-line learning techniques aiming at accelerating existent algorithms. This computational efficiency is necessary, and it is worth t

28、o explore new searching strategies, matrix processing techniques, and so on. In fact, the mere parallelization of existing methods is not trivial since some very basic functionalities are not easily parallelizable (see,

29、e.g., the median, which is not the median of the medians). Otherwise, other aspects can be also considered in order to answer natural questions like: What are the relevant variables in this heterogeneous context? How can

30、 we find and compare them? How to deal in a non trivial way with unstructured information?These and other problems are largely motivated by the variability and characteristics of the observations in big dataproblems (not

31、 just the dimension). Here, there is no longer a set of homogeneous measures, or even numeric items. The new variables can come from very different contexts and we do not even know how to code them adequately to extract

32、the useful information. We must explore new ways to develop effective indicators, starting, for example, from opinions in social networks or from telephone calls to understand problems such as the level of satisfaction o

33、r recommendation systems. The complexity of some structures means that in some cases we lack even basic descriptive statistics and so it is necessary to propose new metrics or ways to establish relationships in order to

34、be able to measure and compare (Marron and Alonso, 2014). Finally, it will also be important to study the properties of new methods and measures and to develop new visualization tools (Keim et al., 2013).4. Big samplesWe

35、 have been dealing with the dimension problem, even with certain levels of heterogeneity, for many years and we havesome experience to face them. Nevertheless, big data has brought a series of new problems related to the

36、 very high number (and often, low quality) of observations that were not expected a few years ago. On the contrary, from a classical point of view, having a lot of data was always considered as something positive that wo

37、uld make the models converge and practically allow us to achieve population results. We have often read and said phrases like ‘‘the problem is that we do not have enough data’’ and explain to our students how having enou

38、gh data makes everything work. Well, now we have a lot of data and not everything works; in fact, what was promised as a blessing is rather a kind of curse with unexpected consequences that open new lines for research. B

39、elow we will briefly discuss some of the problems that have arisen due to the huge size, and sometimes low quality, of the samples.Heterogeneous sources. In big data applications, it is no longer common for observations

40、to come from a single source with a single manageable coding. Quite the contrary, what we are usually interested in is capturing data from different sources (with different levels of information, preferably complementary

41、) which, unfortunately, provides samples collected in very different ways (Gandomi and Haider, 2015). Let us think, for example, in a big company. It will probably be interested in its expenses and revenues, costs and be

42、nefits, but also in the performance of the last advertising campaign, in the relationships between its clients, in its public image or in levels of satisfaction among its clients. Data related to some of these topics wil

43、l be obtained in a traditional and structured way (tabulated) but it will be also necessary to analyze comments on social networks, customer calls, survey results, etc., and include, as far as possible, exogenous variabl

44、es such as demographic and economic indicators. Hence, this whole process combines structured, non-structured and semi-structured information from which one have to create comparable variables that capture the really use

45、ful information. Therefore, it is necessary to combine sources efficiently and ‘consistently’. Currently, companies are already doing it, with more or less success, at purely heuristic levels but we can look for a sense

46、in these combinations.Subpopulations-clusters. Basic hypotheses when working with statistical inference models are independence and identical distribution of the data. Moreover, there exist models for dependent data and

47、ways to detect subsamples with different distributions. However, what in conventional statistics are particular cases, in big data problems becomes the general rule (Fan et al., 2014). Since reality is complex and big da

48、ta capture big pieces of this reality, it seems natural to have complex distributions including mixtures of different populations. This leads to new settings, starting with the failure of traditional models. Having sever

49、al subgroups introduce difficulties in the study and analysis of the data, such as the Yule– Simpson effect. It seems interesting to find the different groups that exist in the sample and characterize them for their stud

50、y, perhaps applying different models to different groups. Then it is relevant to decide what to do with the minority groups that are no longer so, are they outliers? Should we ignore them? Correct them? Study them separa

溫馨提示

  • 1. 本站所有資源如無特殊說明,都需要本地電腦安裝OFFICE2007和PDF閱讀器。圖紙軟件為CAD,CAXA,PROE,UG,SolidWorks等.壓縮文件請下載最新的WinRAR軟件解壓。
  • 2. 本站的文檔不包含任何第三方提供的附件圖紙等,如果需要附件,請聯(lián)系上傳者。文件的所有權(quán)益歸上傳用戶所有。
  • 3. 本站RAR壓縮包中若帶圖紙,網(wǎng)頁內(nèi)容里面會有圖紙預(yù)覽,若沒有圖紙預(yù)覽就沒有圖紙。
  • 4. 未經(jīng)權(quán)益所有人同意不得將文件中的內(nèi)容挪作商業(yè)或盈利用途。
  • 5. 眾賞文庫僅提供信息存儲空間,僅對用戶上傳內(nèi)容的表現(xiàn)方式做保護(hù)處理,對用戶上傳分享的文檔內(nèi)容本身不做任何修改或編輯,并不能對任何下載內(nèi)容負(fù)責(zé)。
  • 6. 下載文件中如有侵權(quán)或不適當(dāng)內(nèi)容,請與我們聯(lián)系,我們立即糾正。
  • 7. 本站不保證下載資源的準(zhǔn)確性、安全性和完整性, 同時也不承擔(dān)用戶因使用這些下載資源對自己和他人造成任何形式的傷害或損失。

最新文檔

評論

0/150

提交評論