2023年全國碩士研究生考試考研英語一試題真題(含答案詳解+作文范文)_第1頁
已閱讀1頁,還剩91頁未讀, 繼續(xù)免費閱讀

下載本文檔

版權(quán)說明:本文檔由用戶提供并上傳,收益歸屬內(nèi)容提供方,若內(nèi)容存在侵權(quán),請進(jìn)行舉報或認(rèn)領(lǐng)

文檔簡介

1、統(tǒng)計基礎(chǔ)與STATISTICA軟件,工程應(yīng)用數(shù)學(xué),2,Introduction,IntroductionThere are many aspects of science and engineering problems. Understanding and solving such problems often involves certain quantitative aspects, in particular the acqu

2、isition and analysis of data. Treating these quantitative problems effectively involves the use of statistics. Statistics can be viewed as the prescription for making the quantitative learning process effective.,3,The

3、Learning Process,The Learning Process (認(rèn)知過程) An experiment is like a window through which we view nature. Our view is never perfect. The observations that we make are distorted. The imperfections that are included in ob

4、servations are “noise”. A statistically efficient design reveals the magnitude and characteristics of the noise. It increases the size and improves the clarity of the experimental window. Using a poor design is like see

5、ing blurred shadows behind the window curtains or, even worse, like looking out the wrong window.,4,The Learning Process,Learning is an iterative process,5,The Aim,Introduction to the general kind of engineering problem

6、and the statistical concepts and methods to be discussed.Case Study introduces a specific example, including actual data.Analysis shows how the data suggest and influence the method of analysis and gives the solution.

7、Many solutions are stepped in detail, and results shown. The problems were solved using available computer programs (e.g., STATISTICA、SAS、SPSS、S-PLUS、MINITAB etc.).,6,Definitions and Basic Concepts,Population(總體) and Sam

8、ple(樣本)The sample is a group of n observations actually available. A population is a very large set of N observations (or data values) from which the sample of n observations can be imagined to have come.Random Variab

9、le(隨機(jī)變量)“the value of the next observation in an experiment.” “A random variable is the soul of an observation” and the converse, “An observation is the birth of a random variable.”Experimental Errors(實驗誤差)A guiding

10、principle of statistics is that any quantitative result should be reported with an accompanying estimate of its error. Replicated observations of some physical, chemical, or biological characteristic that has the true va

11、lue ηwill not be identical although the analyst has tried to make the experimental conditions as identical as possible.,7,Definitions and Basic Concepts,Experimental Errors(實驗誤差)This relation between the true value η an

12、d the observed (measured) value yi is yi = η+ei , where ei is an error or disturbance.Error, experimental error, and noise refer to the fluctuation or discrepancy in replicate observations from one experiment to another

13、. In the statistical context, error does not imply fault, mistake, or blunder. It refers to variation that is often unavoidable resulting from such factors as measurement fluctuations due to instrument condition, sampli

14、ng imperfections, variations in ambient conditions, skill of personnel, and many other factors. Such variation always exists and, although in certain cases it may have been minimized, it should not be ignored entirely.,8

15、,Example,ExampleA laboratory’s measurement process was assessed by randomly inserting 27 specimens having a known concentration of η=8.0 mg/L into the normal flow of work over a period of 2 weeks.This arrangement means

16、 that observed values are random and independent. The results in order of observation were 6.9, 7.8, 8.9, 5.2, 7.7, 9.6, 8.7, 6.7, 4.8, 8.0, 10.1, 8.5, 6.5, 9.2, 7.4, 6.3, 5.6, 7.3, 8.3, 7.2, 7.5, 6.1, 9.4, 5.4, 7.6, 8.1

17、, and 7.9 mg/L.The population is all specimens having a known concentration of 8.0 mg/L. The sample is the 27 observations (measurements). The sample size is n=27. The random variable is the measured concentration in

18、 each specimen having a known concentration of 8.0 mg/L.Experimental error has caused the observed values to vary about the true value of 8.0 mg/L. The errors are 6.9 ? 8.0=?1.1, 7.8?8.0=?0.2,+0.9,?2.8,?0.3,+1.6,+0.7, a

19、nd so on.,9,Plotting Data,The most effective statistical techniques for analyzing data are graphical methods. They are useful in the initial stage for checking the quality of the data, highlighting interesting features

20、of the data, and generally suggesting what statistical analyses should be done. Graphical methods are useful again after intermediate quantitative analyses have been completed.And again in the final stage for providing

21、 complete and readily understood summaries of the main findings of investigationsThe first step in data analysis should be to plot the data. Graphing data should be an interactive experimental process. Do not expect yo

22、ur first graph to reveal all interesting aspects of the data. Make a variety of graphs to view the data in different ways.,10,Plotting Data,Plotting the Data may:1. reveal the answer so clearly that little more analysi

23、s is needed.2. point out properties of the data that would invalidate a particular statistical analysis.3. reveal that the sample contains unusual observations4. save time in subsequent analyses.5. suggest an answer

24、that you had not expected.6. keep you from doing something foolish.,11,Plotting Data,The time spent making some different plots almost always rewards the effort. Many top-notch statisticians like to plot data by hand, b

25、elieving that the physical work of the hand stimulates the mind’s eye. Whether you adopt physical work method or use one of the many available computer programs (Origin PRO, SigmaPlot, Grapher, etc.), the goal is to fre

26、e your imagination by trying a variety of graphical forms. Keep in mind that some computer programs offer a restricted set of plots and thus could limit rather than expand the imagination.,12,Scatterplots and Statistic

27、al Plot,Scatterplots It has been estimated that 75% of the graphs used in science are scatterplots. Simple scatterplots are often made before any other data analysis is considered. The insights gained may lead to more e

28、legant and informative graphs, or suggest a promising model. Linear or nonlinear relations are easily seen. Showing Statistical Variation and PrecisionMeasurements vary and one important function of graphs is to show t

29、he variation. There are three very different ways of showing variation: a histogram, a box plot (or box-and-whisker plot), and with error bars that represent statistics such as standard deviations(標(biāo)準(zhǔn)偏差), standard errors,

30、 or confidence intervals(置信區(qū)間).A histogram shows the shape of the frequency distribution and the range of values,13,Plots of Residuals,Plots of ResidualsGraphing residuals is an important method that has applications i

31、n all areas of data analysis and model building. Residuals are the difference between the observed values and the smooth curve constructed from a model of the data. If the model fits the data, the residuals represent t

32、he measurement error. Measurement error is usually assumed to be random. A lack of randomness in the residuals therefore indicates some weakness in the fitted model.,14,Plots of Residuals –Example,The visual impression i

33、n the top panel in Figure is that the curve fits the data fairly well but the vertical deviations of points from the fitted curve are smaller for low values of time than for longer times. The graph of residuals in the b

34、ottom plot shows the opposite is true. The curve does not fit well at the shorter times and in this region the residuals are large and predominantly positive,15,Plots of Residuals,This process of plotting residuals flatt

35、ening the data. It shifts our attention from the fitted line to the discrepancies between prediction and observation. It is these discrepancies that contain the information needed to improve the model. Make it a habit t

36、o examine the residuals of a fitted model, including deviations from a simple mean. Check for normality by making a dot diagram or histogram. Plot the residuals against the predicted values, against the predictor varia

37、bles, and as a function of the time order in which the measurements were made. Residuals that appear to be random and to have uniform variance are persuasive evidence that the model has no serious deficiencies. If the

38、residuals show a trend, it is evidence that the model is inadequate. If the residuals spread out, it suggests that a data transformation is probably needed.,16,Plots of Residuals – Another Example,Left Figure is a calibr

39、ation curve(標(biāo)準(zhǔn)曲線) for measuring chloride using an ion chromatograph. There are three replicate measures at each concentration level. The hidden variation of the replicates is revealed in Right Figure, which has flattened

40、 the data by looking at deviations from the average of the three values at each level. An important fact is revealed: the measurement error tends to increase as the concentration increases. This must be taken into accou

41、nt when fitting the calibration curve to the data.,17,A Note on Clarity and Style of Plot,TufteClarity(清楚) Simplicity (簡潔)ClevelandClarity (清楚)Precision (精確)Efficiency (有效)WainerElegance (典雅)Grace (優(yōu)雅)Impact (效

42、果),William Playfair (1786), a pioneer and innovator in the use of statistical graphics, desires to tell a story graphically as well as dramatically.,18,Should We Always Plot the Data?,Example. five values:pH = 5, COD=23

43、00mg/L, BOD=1500mg/L, TSS=875mg/L, TDS=5700mg/LThese five values say it all, and better than the graph. Do not use an axe to hack your way through an open door.Aside from being unnecessary, this chart has three major f

44、aults.,It confuses units - pH is not measured in mg/L. Three-dimensional effects make it more difficult to read the numerical values. Using a log scale makes the values seem nearly the same when they are much differen

45、t. The 875 mg/L TSS and the 1500 mg/L COD have bars that are nearly the same height.,19,STATISTICA的統(tǒng)計分析功能,,,20,STATISTICA的統(tǒng)計分析功能,1、Basic Statistics and Tables(基本統(tǒng)計和表格分析)包括描述性統(tǒng)計,相關(guān)性分析,獨立或非獨立樣本的t檢驗,頻數(shù)統(tǒng)計表,概率計算及其他差異顯著性檢驗(兩

46、個均值或百分率的檢驗)等。這是最基本的統(tǒng)計分析項目,也是用的最多的統(tǒng)計分析項目,一般簡單的統(tǒng)計分析靠它就可以圓滿解決問題。 2、Multiple Regression(多元回歸分析)逐步回歸分析,固定非線性分析,殘差分析和基于回歸模型的預(yù)測等。如果您要調(diào)查研究人的智商是否與吃魚和吃豆腐有關(guān),就可以用回歸法來分析。 3、ANOVA/MANOVA(方差分析)有單因素和多因素方差分析、協(xié)方差分析和重復(fù)測量方差分析等。兩個以上樣本平均

47、數(shù)差異的顯著性檢驗,就可利用方差分析。如:比較幾種教學(xué)方法哪一種對學(xué)習(xí)成績提高最快,比較幾種牌號汽油的行程率等等。,21,STATISTICA的統(tǒng)計分析功能,4、Nonparametrics/Distribution(非參數(shù)性統(tǒng)計分析)包括Chi-square卡方檢驗,Kolmogorov-smirnov檢驗,Wilcoxon配對符號等級檢驗,兩個獨立樣本Mann-Whitney檢驗,多個相關(guān)樣本Cochran Q檢驗和多個獨立樣本

48、Kruskal-Wallis檢驗等等。 5、分布擬合(Distribution Fitting)對連續(xù)性分布進(jìn)行擬合,如正態(tài)分布、均勻分布等。6、高級線性/非線性模型(Advanced Linear/Nonlinear Models)包含各種線性和非線性模型化分析功能。如Nonlinear Estimation(非線性估計):包括一般非線性模型,逐步Logit分析,最大似然估計等。 7、工業(yè)統(tǒng)計與6-σ(Industrial

49、Statistics & Six-Sigma)包括質(zhì)量控制、過程分析、實驗設(shè)計、6-σ分析,22,STATISTICA的統(tǒng)計分析功能,8、多元分析(Multivariate Exploratory Analysis):(1)、Cluster Analysis(聚類分析):包括K-Means聚類,雙邊聯(lián)合聚類等。聚類分析實質(zhì)上是尋找一種能客觀反映元素之間親疏關(guān)系的統(tǒng)計量,然后根據(jù)這種統(tǒng)計量把元素分成若干類,是物以類聚的一種統(tǒng)計

50、分析方法。(2)、Factor Analysis(因子分析):初始因子模型、旋轉(zhuǎn)因子模型等。例如,學(xué)生的各科成績受智力、計算能力、表達(dá)能力和靈活性等因子的影響,雖然可以通過考試或檢查等手段獲得學(xué)生的各科成績,但那些對各科成績起支配作用的因子的狀態(tài)不能直接測定到,這時候因子分析就派上用場了。(3)、Canonical Analysis(典型分析):典型相關(guān)性分析,典型因子協(xié)效應(yīng)分析。主要用于研究兩組多變量之間的相關(guān)性。(4)、Mul

51、tidimensional Scaling(多維尺度分析):多維距離或相似性估計等。(5)、Reliability/Item Analysis(信度/項目分析):包括trachoric相關(guān)性分析,Crobach α系數(shù),分半(split-h(huán)alf)信度分析等。假如希望在任何時間、地點、對任何人,都有可靠的交通工具,測試交通工具手段的可靠性顯然是需要的。(6)、Discriminant Analysis(判別分析):逐步判別法,分類統(tǒng)

52、計等。判別分析的任務(wù)是根據(jù)已掌握的一批分類明確的樣品,建立較好的判別函數(shù),使產(chǎn)生錯判的事例最少,進(jìn)而對給定的一個新樣品,判斷它來自哪個總體。如在環(huán)境檢測中,根據(jù)對某地區(qū)的環(huán)境污染的綜合測定結(jié)果判斷該地區(qū)屬于哪一種污染類型等。 9、數(shù)據(jù)挖掘技術(shù)(Data Mining),分類樹等技術(shù),神經(jīng)網(wǎng)絡(luò)。10、分布計算器(Probability Calculator)。,23,STATISTICA軟件的圖形界面,,24,STATISTICA的基

53、本操作過程,(1)數(shù)據(jù)的輸入,主要通過SpreadSheet Window(數(shù)據(jù)編輯窗口)完成。 其結(jié)構(gòu)類似于Excel的工作表,缺省的數(shù)據(jù)表是10×10的單元格集,可以更改變量(Variable)或觀測值(Case)的數(shù)量。要注意的是,由于空的單元格要按缺省值計算,故要刪除不需要的Case。Variable和Case的刪除可以通過EDIT菜單的DELETE命令執(zhí)行,Variable和Case的增加則通過Format菜單上的

54、Variables和Cases命令執(zhí)行。,STATISTICA可以打開的文件類型包括Excel, dBASE, SPSS, Lotus/Quattro Worksheets等程序產(chǎn)生的文件和擴(kuò)展名為txt, csv, htm, rtf等文本格式,并以STATISTICA數(shù)據(jù)文件的格式保存。,25,STATISTICA的基本操作過程,(2)選擇功能模塊,主要通過Statistics菜單中的命令來完成。,,,26,STATISTICA的基本

55、操作過程,(3)定義分析方法,選擇分析數(shù)據(jù)的自變量和因變量。,27,STATISTICA的基本操作過程,(4)顯示分析結(jié)果。Stattistics的分析結(jié)果的默認(rèn)輸出方式是Workbook窗口,包括表格和圖形,分析結(jié)果的另外一種輸出方式是Report方式,,28,Report方式的選項,“File”菜單的“Outpur Manager …”命令,,29,Report窗口,,30,應(yīng)用實例1——描述性統(tǒng)計,1、描述性統(tǒng)計(Descrip

56、tive statistics)描述性統(tǒng)計是統(tǒng)計的基礎(chǔ)。其任務(wù)是為每個統(tǒng)計變量提供基礎(chǔ)信息:平均值(mean)最小值與最大值(minimum and maximum values)測量值的變化( variation of measures ),也就是分布的形狀(shape of the distribution)標(biāo)準(zhǔn)偏差(standard deviation)標(biāo)準(zhǔn)誤差(standard error)(1)、平均值定義

57、平均值是最常用的統(tǒng)計描述量,它給出了變量的一種“趨向中心”的信息,當(dāng)然是要在在滿足置信區(qū)間的條件下。置信區(qū)間是群體的“真實”平均值信息在我們可以接受的可信度范圍內(nèi)的一個尺度。,,31,1、描述性統(tǒng)計,(1)、平均值例如:如果平均值為23,在p=0.05的置信區(qū)間的下限和上限分別為19和27,那么群體平均值大于19或小于27的可能性為95% 。如果p水平取一個較小的值(也就是降低可信度),那么置信區(qū)間會變寬,同時也增加了估計的可靠性。

58、如我們熟悉的天氣預(yù)報,減小p水平值,置信區(qū)間越寬,則預(yù)報也越模糊。需要注意的是:置信區(qū)間依賴于樣本的大小(sample size)和數(shù)據(jù)值的變化(variation of data values)。樣本越大,平均值越可靠;數(shù)據(jù)值變化越大,平均值的可信度越低。另外,置信區(qū)間的計算假設(shè)群體變量是隨機(jī)的,并服從正態(tài)分布。如果這個假設(shè)不滿足,那么即使樣本足夠大,估計值也是無效的。,32,1、描述性統(tǒng)計,(2)分布的形狀(shape of

59、 the distribution)描述統(tǒng)計變量的另一個重要方面就是分布的形狀,它表達(dá)了變量的值在不同變化范圍的頻率,并采用柱狀圖描繪這個分布的頻率。通常研究人員感興趣的是將這個柱狀圖與正態(tài)分布圖進(jìn)行比較來判斷。,柱狀圖可以檢驗分布質(zhì)量,例如,分布是雙峰的(有兩個頂點),這可能是由于樣本是不均勻的,他可能來自兩個不同的群體,一個更接近正態(tài)分布,一個則要差一些。這種情況下,需分別對兩個子樣本進(jìn)行分析。,33,2、相關(guān)性(Correlat

60、ion),相關(guān)性是兩個或多個變量之間的聯(lián)系的一種度量,通過相關(guān)系數(shù)(correlation coefficients)處理不同類型的數(shù)據(jù)。最重要的一種相關(guān)性的是線性相關(guān),也稱皮爾遜相關(guān)(Pearson r),是最廣泛使用的相關(guān)系數(shù)類型。,假設(shè)兩個變量在最小區(qū)間上進(jìn)行測量,那么皮爾遜相關(guān)是指兩個變量之間的相互比例關(guān)系,這個比例值就是相關(guān)系數(shù)(r) 相關(guān)系數(shù)的變化范圍是從-1.00 到 +1.00。-1.00表示負(fù)相關(guān)性,+1.00表示正

61、相關(guān)性,0.00表示沒有相關(guān)性。,34,2、相關(guān)性(Correlation),皮爾遜相關(guān)系數(shù)(r)不依賴于特定的變量單位。例如,高度和重量就是可以用來進(jìn)行相關(guān)性分析,而不管它們的單位為英寸和磅,還是厘米與公斤。,比例表示它們是線性的,可以用一條向上或向下的直線表示。這條線可以稱作回歸線或最小二乘線,也就是所有的點與直線的距離的平方和最小。尤其是距離的平方(r2)更是反映兩個變量的變化比例關(guān)系的重要結(jié)果。,35,3、t-檢驗——獨立樣本的

62、t-檢驗(t-test for Independent Samples),t-檢驗是評價兩個樣本的區(qū)別的最重要方法。例如,t-檢驗可以用來測試兩組患者使用不同治療方式取得效果的差異。理論上,即使樣本量很小,只要每一組樣本服從正態(tài)分布(怎么判斷?正態(tài)分布假設(shè)可以通過柱狀圖顯示的數(shù)據(jù)分布判斷,或者正態(tài)分布假設(shè)檢驗),就可以使用t-檢驗。在t檢驗結(jié)果中的p-水平表達(dá)了拒絕假設(shè)檢驗(兩組樣本觀測沒有區(qū)別)的可能性(概率)。 為了執(zhí)行獨立樣

63、本的t-檢驗,需要一個自變量(如下表中的“GENDER”)和至少一個因變量(如測試分?jǐn)?shù)“WCC”)。自變量的平均值將被根據(jù)不同的組(如“male”和“female”)進(jìn)行分別計算并作比較。 如果因變量有多個,則分別對每一個因變量作t-檢驗。,36,例1:描述性統(tǒng)計分析,問題描述:以STATISTICA自帶的“Adstudy.sta”數(shù)據(jù)文件說明該方法的使用,這是一個包含25個變量和50個測試數(shù)據(jù)的文件。該假想的問題是研究男性和女性對兩

64、個廣告的評價,假設(shè)針對每一個廣告的回答都是隨機(jī)的。變量1是性別(Gender: male, female),變量2是廣告(Advert: Coke,Pepsi?)。他們在23個不同的方面(Measure01 to Measure23)對不同的廣告分別作出評價,在每個方面在0~9的范圍內(nèi)給出答案。,37,描述性統(tǒng)計,第一步:啟動STATISTICA軟件,打開位于“/Examples/Datasets”目錄下的數(shù)據(jù)文件“Adstudy.s

65、ta”。也可以從統(tǒng)計模塊中打開數(shù)據(jù)文件。 第二步:描述性統(tǒng)計(Descriptive Statistics) 在“Basic Statistics and Tables” 對話框中,選擇“Descriptive statistics”,對所有的變量進(jìn)行描述性統(tǒng)計分析。,38,描述性統(tǒng)計分析結(jié)果,默認(rèn)的在統(tǒng)計結(jié)果表格中包含有選擇變量的平均值(mean)、有效例數(shù)(valid N)、標(biāo)準(zhǔn)偏差(standard deviation

66、)、最小值和最大值(minimum and maximum)。,39,相關(guān)性分析,第三步:相關(guān)性分析: 單擊“Cancel”鍵返回“Basic Statistics and Tables”對話框,選擇“Correlation matrices”,單擊“OK”按鈕?;蛘唠p擊“Correlation matrices”選項。,40,相關(guān)性分析,則顯示“Product-Moment and Partial Correlations”對

67、話框。,單擊“One variable list”按鈕,在變量選擇窗口中可以選擇一個、多個甚至所有的變量,在這里,單擊“Select all”選中所有變量。,然后單擊“Summary”按鈕進(jìn)行相關(guān)性分析,顯示相關(guān)性分析結(jié)果的表格。,,41,相關(guān)性分析結(jié)果的表格,高亮顯示重要的相關(guān)性:默認(rèn)的情況下,表格用不同的顏色顯示的結(jié)果是統(tǒng)計重要度p<.05的相關(guān)系數(shù)。,用戶可以設(shè)定高亮顯示相關(guān)系數(shù)的水平,相關(guān)系數(shù)的絕對值越大,參數(shù)間的相關(guān)性越

68、高。,相關(guān)系數(shù)為正,也就是正相關(guān),否則為負(fù)相關(guān)。,42,相關(guān)性分析結(jié)果,設(shè)定統(tǒng)計重要度的方法是,再次選擇“Product-Moment and Partial Correlations”對話框,單擊”O(jiān)ptions”標(biāo)簽,改變p-水平的值,例如,0.001。單擊“Summary”按鈕,則產(chǎn)生新的相關(guān)性分析結(jié)果表格,在所有結(jié)果中,滿足這個統(tǒng)計重要度的結(jié)果高亮顯示,可以容易的發(fā)現(xiàn)相關(guān)性最高的點,43,相關(guān)性分析結(jié)果,本例中Measure05

溫馨提示

  • 1. 本站所有資源如無特殊說明,都需要本地電腦安裝OFFICE2007和PDF閱讀器。圖紙軟件為CAD,CAXA,PROE,UG,SolidWorks等.壓縮文件請下載最新的WinRAR軟件解壓。
  • 2. 本站的文檔不包含任何第三方提供的附件圖紙等,如果需要附件,請聯(lián)系上傳者。文件的所有權(quán)益歸上傳用戶所有。
  • 3. 本站RAR壓縮包中若帶圖紙,網(wǎng)頁內(nèi)容里面會有圖紙預(yù)覽,若沒有圖紙預(yù)覽就沒有圖紙。
  • 4. 未經(jīng)權(quán)益所有人同意不得將文件中的內(nèi)容挪作商業(yè)或盈利用途。
  • 5. 眾賞文庫僅提供信息存儲空間,僅對用戶上傳內(nèi)容的表現(xiàn)方式做保護(hù)處理,對用戶上傳分享的文檔內(nèi)容本身不做任何修改或編輯,并不能對任何下載內(nèi)容負(fù)責(zé)。
  • 6. 下載文件中如有侵權(quán)或不適當(dāng)內(nèi)容,請與我們聯(lián)系,我們立即糾正。
  • 7. 本站不保證下載資源的準(zhǔn)確性、安全性和完整性, 同時也不承擔(dān)用戶因使用這些下載資源對自己和他人造成任何形式的傷害或損失。

評論

0/150

提交評論