November 6, 2009

Epi - Statistic tools for Epidemiology

Epi has been around for a long time, starting back in the days of DOS! Over the years Epi has matured into a suite of tools(Pepi) that has the fields of public health and epidemiology as its focus. WINPEPI software for the Windows platform is free and offers a board mix of software that general statistics users might consider as an alternative to more costly options. Epi is like the Energizer Bunny, it is the gift that keeps going and going....

October 30, 2009

MCA - a thing of the past?

All pair-wise Multiple Comparisons(MCA) is a well known collection of procedures for the stochastic ordering of means; which is a common research task. Classical methods rely on the assumption that the null hypothesis is true. Modern alternatives can be found in the Bayesian Statistics paradigm which abandons the Type 1 error notion. In particular, for problems that can be cast in the hierarchical modeling framework, a principled Bayesian approach relies on partial pooling and shrinkage. Technical arguments supporting this approach have been around for some time. An excellent working paper by Andrew Gleman on the topic presents an overview, simulation results and examples demonstrating the benefits in an applied setting. Suggestions on the use of R and other software is mentioned for implementation.

October 27, 2009

An improved Spatial Scan Statistic

Spatial scan statistics have been an important class of tools for cluster detection in spatial data. These are often used in support of surveillance and detection activities in public health and other fields. A common limitation of popular spatial scan statistics is the lack of accommodation in the uncertainty of the measure of interest. In a recent JASA Sept. 2009 article, Weighted Normal Spatial Scan Statistic for Heterogeneous Population Data, the authors offer a solution that addresses this problem in more generality. Weights related to local variance measures or proxies such as sample size can be created for use in a weighted likelihood approach. Extensions to non gaussian probability models are addressed. Some case studies and power simulations provided suggest excellent performance. Their solution has been implemented in the freely available software Satscan.

October 6, 2009

mixAK: New data clustering options

Cluster Analysis(and other tools) are often deployed to investigate structure(clustering) in multidimensional data sets. One approach to model such data is the Gaussian mixture model. mixAK is a new R package for Bayesian estimation of multivariate normal mixtures allowing for selection of the number of mixture components, density estimation and optionally allows for interval-censored multivariate data. Author Arnost Komarek's journal article Computational Statistics and Data Analysis, Volume 53, Issue 12, October 2009, presents the underlying theory and application of the new approach using RJ-MCMC estimation. The selection of the number of mixture components is aided by Deviance Information Criterion(DIC) and Penalized Expected Deviance(PED) measures.

September 17, 2009

Survey weights and new ANES suggestions

Many large surveys are structured as complex sample designs that reflect various stratification considerations. Statistics calculated from such designs must be weighted to reflect the general population of interest. A clear discussion and set of recommendations by four prominent researchers for the calculation and implementation of weights using ANES datasets can be found in the Sept. 2009 Technical Report, nes012427, Computing Weights for American National Election Study Survey Data. The report can be found in the Reference Library section of the ANES website. Single panel cross-sectional, two-wave panel and multi-wave panel recommendations are considered along with nonresponse and poststratification weighting. The generality of discussion applies to other large studies such as Census data, and similar surveys.

September 9, 2009

Areal and point source spatial data models

Researchers using spatial data are often faced with a mix of data obtained from several levels of scale, aggregation and point reference data. Classical geospatial regressions do not deal with this mix very well, and standard ordinary regressions even worst. A unified treatment is the topic of a recent article, "Reparameterized and Marginalized Posterior and Predictive Sampling for Complex Bayesian Geostatistical Models" in Volume 18, Number 2 of JCGS. In short, the authors cleverly reparameterized and recast the problem so as to allow efficient MCMC samplers to address the Bayesian estimation task. Their article's supplemental materials provide the R and OpenBugs codes to address the efficient estimation tasks outlined.

September 8, 2009

Spss resources

Spss software has an extensive tutorial built into its product and most first time users will benefit from using it. Additional Spss resources can be found here.

September 3, 2009

R available on Tufts Linux Cluster

Elsewhere on this Blog I mention various bits and pieces of R software. Now that the fall semester is upon us, we have added many new R BioInformatic packages to the baseline R installation on our research linux cluster. This option provides a scalable solution to those needing additional computing power.

July 14, 2009

Bayes Software...the next big effort

Historically, Bayesian solutions were computed as needed in formal languages(Fortran, C,java,etc...) and later in high level solutions like Matlab,Gauss,SAS/IML and others. Then Winbugs came along and offered a higher level interface, similar to what Matlab did for linear algebra syntax and functionality, but closer in spirit to the notation used by Statisticans to depict multilevel probability based models. While all of these still have their pros and cons, we find now an explosion of Bayesian solutions implemented in R with the benefit of object orientation. If one takes a look at the "CRAN Task View: Bayesian Inference" page on the R site maintained by Jong Hee Park, one will find 60+ packages with numerous solutions to many standard statistical modeling problems. Of the many listed, note the package BAS for Bayesian Model Averaging in linear models using stochastic or deterministic sampling without replacement from posterior distributions. Prior distributions on coefficients are from Zellner's g-prior or mixtures of g-priors corresponding to the Zellner-Siow Cauchy Priors or the Liang et al hyper-g priors. The stochastic search capability allows for model specification searches that would not have been possible a few years ago with the ease that is now possible.

May 21, 2009

Inference for R and MS Excel

The widespread availability of MicroSoft Excel has created a less than desirable environment for statistical computing. In my opinion the Excel statistics add-in leaves much to be desired relative to real statistics packages. One solution for extending the usefullness of Excel is to abandon the Excel stats package in favor of InferenceforR. This product allow for the use of R within Excel. See the following screencast for a slick presentation.

May 6, 2009

Numerical routines for Java developement

If you want to save time and improve accuracy of your programs, don't reinvent the wheel, consider using javanumerics. A large variety of statistical and mathematical classes are available. Note, not all options are free.

April 29, 2009

Specialized Statistical Resources

Often researchers will need access to functionality that isn't found in commercial statistics packages. This problem varies quite a bit and is meet with specialized solutions by the statistical community. These solutions are often cutting edge, reflecting new statistical research. Most stats packages allow some form of macro/code authorship. This works to a point and often provides a just in time solution. Well known examples include Matlab's scripting language, SAS IML, GAUSS, Stata, Splus and R. Yet others will seek stand alone solutions in one form or another. These range from public domain C, C++, Fortran, and Java research subrountines to stand-alone programs with various user interfaces. The goal of this blog is to list references and short descriptions of various solutions that may offer additional insights into your research and the statistical methods, and maybe even save you some time. About a dozen or so topics some to mind and I hope to address them shortly. These posts are not intended as statistical guidance nor endorsement. Most problems are best addressed by the advice of an experienced practitioner in the relevant field.

April 13, 2009

Statistical Power calculations

Statistical power calculations are often needed at various stages of planning for establishing sample sizes. Elsewhere on this Blog I mention PiFace as a power calculation tool. However SAS users may find the following three SAS macros of interest. UnifyPow is an extensive collection of power calculators implemented in SAS as a Macro. A SAS proceedings paper about UnifyPow discusses its broad generality. The second macro is rpower and addresses the reprospective aspect of the issue. The third macro, glimmixsamplesize, is designed to use the generality of SAS's Proc Glimmix for generalized linear mixed models. These macros provide a substantial increase in the number of settings that can be addressed for power calculations.

April 9, 2009

R graphics gallery

Sometimes it is important to reinvent the wheel, and sometimes not. Here is a site with a nice collection of contributed R graphic examples from a variety of R packages. Almost all are supportive of some statistical method for purposes of summary and presentation.

March 25, 2009

Discrete Choice Models

This is a broad and rich topic. Applications are found in almost every field. Over the past 30+ years major theoretical contributions from Econometrics, Psychometrics and Statistics have established the topic as a vibrant research area. Most major statistics oriented software packages provide most of the basic functionality. But sometimes this doesn't go far enough. Sometimes real world models are defined with just enough complication that one can't cast the model(s) of interest within the user interface provided by most software. Of course the solution is to step out from those software constraints and code the solution that is needed. Elsewhere in this blog there are software options that may be of use, and sometimes one needs access to codes at a more fundamental level. Another issue is that many researchers are not as familiar with the topic as they wish to be, but would otherwise like to know more. University of Calif. Economics Prof. Kenneth Train has provided both Gauss and Matlab codes addressing many Discrete Choice Models. In addition his site has about 20+ hours of lectures available for streaming download.

March 3, 2009

Clustering Software reviews

Cluster analysis has been a data mining tool for some time. There are hundreds of cluster algorithms that compete for various statistical notions of performance. All major statistical software packages offer several solutions to address this task. Recently the notion of Latent Class Analysis has its version in the cluster analysis problem setting, where the unknown number of classes or groups is treated in either a stochastic or deterministic manner. In The American Statistician Feb 2009, Vol. 63 article, Review of Three Latent Class Cluster Analysis Packages: Latent Gold, poLCA, and MCLUST, one finds yet another discussion and comparison of the ever expanding software choices. The point of this note is the solution offered by the MCLUST program is available as free R software of the highest quality and performance. MCLUST performs model based clustering with multivariate normal mixtures. A Bayesian treatment of the latent class problem by MCLUST treats the unknown number of classes/groups as a random variable and its marginal posterior distribution of the number of classes is an outcome!

February 24, 2009

Bayesian Spatially Varying Regression Coefficients

Spatial smoothing techniques are often employed to estimate mean trends over some spatial and or time domain. An explosion of new estimation methods in the last 15 or so years have improved upon simple multiple regression and Kriging options often found in commercial GIS systems. Spatial regression models for a general linear model setting with different possible link functions and using CAR(conditional autoregressive) or SAR(spatial) error structures were among the early additions. Extensions to hierarchical models allow for additional model complexity at the cost of increased computational burden.The following Article by authors Wheeler and Walker illustrate how the use of Bayesian Spatially Varying Regression Coefficient models improved upon older methods such as Kriging in solving the estimation of the effects of barriers to the transmission of rabies. The estimation of their models were carried out in WinBugs software via MCMC sampling for Bayesian Spatially Varying Regression Coefficient models using MCAR(multivariate conditional autoregressive) errors. To see the inference impact of such models on a per covariate(spatial) basis, one set of maps in Figure 4., illustrates very nicely what is missing in simpler maps and models. This Article presents some statistical background.

February 23, 2009

Small Sample Survey Sampling Confidence Intervals

A recent article in The International Jorunal of Biostatistics argues for confidence intervals for the mean derived from the use of Bernstein's inequality. An excellent presentation compares and contrasts the proposed method with standard alternatives. In keeping with the spirit of this post, one may find R software code proposed by the authors to compute the new method.

January 7, 2009

Cross-Over Designs

Cross-Over Designs have played a major role in applied settings that cuts across so many disciplines. And there is a rich history and literature on the topic. I recently acquired: Design and Analysis of Cross-Over Trials, Second Edition, by Byron Jones and Michael G. Kenward, Chapman & Hall/CRC PRESS. In keeping with the spirit of this blog, I would like mention that the authors provide their SAS code that accompanies their excellent text. As one can see, many solutions are cast in the Mixed Model framework offering covariance structures and estimation options that allow the most flexibility for modeling simple to advanced Cross-Over Designs. The codes are available here.

December 10, 2008

Advanced Log-Linear Models

An excellent book, Models for Discrete Data, Daniel Zelterman, Oxford Science Publications, ISBN 0-19-852436-6, addresses theory and applications. In keeping with this bolg's software focus, the point of mentioning his book is the author's contributed SAS examples. You will find standard and extended applications of SAS's Proc Genmod in a variety of log-linear model settings using Binomial, Poisson, Hypergeometric and Negative Binomial response models. Another excellent text is Alan Agresti's Categorical Data Analysis, Wiley, ISBN 0-471-36093-7. His contributed SAS routines are quite extensive as well. A careful study of these books and examples will help most any applied researcher with the discrete random variable setting.

December 1, 2008

Common Principle Components

Covariance matrices are a natural multivariate structure as input for various methods such as Structural Equations Modeling(SEM), Factor Analysis, Principle Components and Cluster Analysis to name a few. Almost all can be shown to be special cases of SEMs. Recently I came across the need to consider the question of Common Principle Components(CPC). The situation is a k-group setting involving questions about shared principle components among the k-groups. Note, this notion is not restricted to equality of covariance matrices. However this particular notion(CPC) is not directly estimated and tested within common statistical software such as SAS's Proc Calis or well known Lisrel. It turns out that this was the focus of statistican Bernhard Flury's (1988) research. CPC software and references to Flury's work can be found here.

Multivariate Nonparametric Methods

Univariate Nonparametric tests have been available for some time. Exact Inference for these tests are also an option from software such as STatXact. However multivariate versions of univariate nonparametric tests are harder to come by in most standard statistical software. Oja and Randles's 2004 paper, Multivariate Nonparametric Tests, Journal of Statistical Science, Vol.19, No. 4 Pg598, addresses the problem and offers an implementation as R routines. Author Oja's website describes access to the following tests: Multivariate sign test; Multivariate sign-rank test; Sign test of independence; Spearman's rho -type test of independence; Kendall's tau -type test of independence; Several-samples MANOVA multivariate sign test and a Several-samples MANOVA multivariate rank test.

November 18, 2008

VUE

I want to put a plug in for VUE. VUE is not a statistical package, but it can be useful to one's modelling efforts interested in mapping the relationship between variables, for presentation needs, extensive anotation capabilities, sharing or just brainstorming. How about mapping relationships on competing theories and other complex structures? The Visual Understanding Environment (VUE) is an Open Source project based at Tufts University. The VUE project is focused on creating flexible tools for managing and integrating digital resources in support of teaching, learning and research. VUE provides a flexible visual environment for structuring, presenting, and sharing digital information

Journal of Statistical Software

This site is very useful for both research and teaching for all things statistical. What a nice resource. Enjoy.

September 14, 2007

LLMM - local linear mixed models

The generality of Mixed Models(Linear or NonLinear) is well known and has found extensive use in just about every applied research setting in the Social Sciences, Medicine, Engineering and GIS applications. Variations of these methods have been implemented in most major statistical packages. A recent article in Journal of Agricultrural, Biological, and Environmental Statistics( Sept. 2007) by Heegaard and Nilsen discuss theory and application in a biological spatial setting. LLMM models can be thought of as complementary to GAMM models but with additional user control of the fixed and stochastic structures for purposes of spatial smoothing. An implementation of LLMM as a contributed R package can be found at: http://eecrg.uib.no/personal_pages/LLMM.htm

June 21, 2007

The Bias Project

The Bias Project is an effort to produce research and software addressing Bayesian methods for combining individual and aggregate data sources as seen in Ecological Inference and Small Area Estimation problems. Many of the software contributions are available as either either Winbugs or R programs. The Bias Project is one of several research efforts at ESRC where unique problems in Social Science research are addressed and explored with modern statistical methods.

June 19, 2007

MIXPREG - mixed effects models

MIXPREG, MIXOR, MIXREG, MIXGSUR and MIXNO are a collection of high quality generalized regression modeling tools that allows for mixed effects and various forms of censoring. Correlated, nested and hierarchical data structures are also addressed. Estimation and inference is based on a full-information maximum likelihood approach rather than a Taylor expansion to linearized the likelihood. The software is free and WinXp, Solaris and Mac versions are available.

May 29, 2007

Solar

SOLAR stands for Sequential Oligogenic Linkage Analysis Routines. SOLAR addresses genetic variance components analysis, including linkage analysis, quantitative genetic analysis, and covariate screening. Two basic types of linkage analysis are available, Twopoint and Multipoint. Maximum Likelihood estimation, Monte Carlo Simulations and Bayesian Model Averaging are some of the options available to address model formation and screening. SOLAR is available on Tufts Bioinformatic Server where larger computational intensive jobs may be run.

CsPro - Census.Gov survey program

CSPro (Census and Survey Processing System) is a public-domain MS Windows software package for entering, editing, tabulating and mapping census and survey data. For those interested in a facility to capture and record new data, CsPro offers a simple interface to support this task. Support for survey Cross Tabulation is available but limited in scope. The other notable feature is the Mapping capability and viewer. Note, this package is not a complete statistical processing option nor an alternative to a GIS solution. Export of selected data/variables as ascii delimited files is available for input to other packages, such as SAS, SPSS.

April 24, 2007

NEOS: AMPL and GAMS server

GAMS (General Algebraic modeling system) is a high level mathemaitcal and optimization programming system. It was orginally developed by the World Bank and has since been further developed by Gams Developement Corp. Problems that can be cast as optimization problems can find a home in GAMS. Dozens of optimization solvers are available for specialized large scale problems. Check the Model Library for subject specific examples. AMPL is similar to GAMS and newer in its developement. Both offer various licensing options in addition to demo or student licenses. NEOS is a free Dept. of Energy computational server hosting GAMS and AMPL code execution. Several interface options are available for submitting codes. The easiest is the web interface to a particular solver of interest. These interfaces are not a subsitute for a license, since there are some restrictions, but may be enough for your problem setting.

March 28, 2007

StocNet - social network analysis

Stocnet is free software designed to address aspects of modeling networks. It offers five different statistical (stochastic) methods to estimate networks, and to calculate some common descriptive network statistics, offer some data transformation and/or graph selection capabilities and explore network simulation possibilities. The five models include: p* models(ERGMs), blockmodeling, p2 models, ultrametric methods for clustering and ZO methods for undirected graphs/networks. Relatedly, the homepage of one of the authors of Stocnet, Tom Snijders, contains useful information about Social Network Analysis.

March 27, 2007

CARMA Resources

Tufts University Information Technology(UIT) Academic Technology(AT) group recently acquired a Tufts subscription to the Center for the Advancement of Research Methods and Analysis(CARMA) website. CARMA provides video lectures addressing statistical research methologies widely used in the Social Sciences. The presentations are designed to be tutorial in nature and self-contained. Many topics are presented at the upper undergrad and graduate school level. Presenters are CARMA Fellows from universities throughout the U.S. Most streaming lectures are 60-90 minutes in duration and include downloadable powerpoint slides. Lecture presentations have included topics such as Limited Dependent Variable regression, Structural Equations Models, Meta-Analysis, NonResponse in Surveys, Robust Regression, Item Response Theory, Latent Growth models, Hierarchical Modeling and many more. A schedule for Spring 2007 Webcasts is available on their website along with archieved lectures. Tufts faculty and students are required to register on the CARMA site with a Tufts email address and to obtain a password for access.

March 26, 2007

Gllamm: Generalized Linear Latent & Mixed Models

gllamm is a user contributed STATA program to handle a very wide variety of models for addressing multilevel latent and mixed variable models. There are three components to gllamm: estimations tasks(gllamm), post-estimation predictions tasks(gllapred) and simulation(gllasim). Why would one care? In some settings having a unified treatment of estimation for many seemingly unrelated models can help one gain insights into applications and estimation inter-relationships. For example the following models: GLMMs, Multilevel Regressions, Factor models, Item Response, SEM models and Latent Class models are all special cases. If you have access to STATA follow the instructions to download and install gllamm.

February 9, 2007

Extreme value random variables

Most major stats packages offer some modeling and inference capability for extreme value random variables. Typically one finds such functionality within Survial Analysis types of routines. For example, SAS offers Proc LIFEREG procedure which fits parametric models to failure time data that can be right-, left-, or interval-censored for a variety of extreme value distributions. Additional related functionality can be found in Proc PHREG, Proc LIFETEST and Proc TPHREG. Another userful source for additional extreme value functionality not found in SAS or SPSS routines can be found in R. The following list of contributed packages addressing this topic include: evd, evdbayes, evir, extRemes and ismev. For example the evd package offers simulation, distribution, quantile and density functions to univariate and multivariate parametric extreme value distributions, and provides fitting functions which calculate maximum likelihood estimates for univariate and bivariate maxima models, and for univariate and bivariate threshold models. And the evdbayes package offers functions for the Bayesian analysis of extreme value models, using MCMC methods. Remaining listed R packages address topic such as: exploratory data analysis, block maxima, peaks over thresholds (univariate and bivariate), point processes, gev/gpd distributions. .....now that is extreme.....

January 31, 2007

DAWeb - Decision Analysis Society

This post is a bit of a digression from previous specific software tool discussions, but interesting nonetheless since so many people don't really think of Decision Support and Analysis as an area of study. Decision Analysis is a broad area of study. The fields of Statistics, Economics, Operations Research and Psychology(to name a few) have contributed many facets of foundational understanding and contributions to real world complex application settings. DAWeb is the web site of the Decision Analysis Society. With regard to software, the various links offer numerous options for those interested in specialized offerings.

January 17, 2007

Norm and Missing Data options

During the 1980s Rubin, Little and others established the statistical foundations of Missing Data problems. A Bayesian statistical justification for Multiple Imputation methods provided a principled approach to "fill in" missing data and pooling estimates across solutions based on completed data. NORM by Joseph Schafer is a late 1990's easy to use Windows based program that implements the methods of Rubin and Little. A major benefit of NORM is ease of use and the author's excellent commentary concerning guidance and theoretical contributions. One downside is that for some workflow styles the interface becomes a burden. In addition, the educational benefit of NORM is not to be overlooked. One the other hand, NORM, SAS and STATA have incorporated missing data routines that are better integrated with their other statistical models/methods. This aspect reduces the burden of use in a more general data analysis setting.

January 4, 2007

DEW - Experimental Design program

DEW is a web based program to help plan Design of Experiments(DOE) designs. Currently block designs, general factorial designs, response surface designs, and more are available. Some output is available as cut-n-paste tables, other options include R code or GenStat sample code. There are many programs devoted to DOE, DEW is easy to use and may meet some of your needs. On the other hand, if you have access to SAS several DOE programs are available. The SAS/STAT product has routine Proc Plan and the SAS/QC product offers Proc Factex and Proc Optex. In addition, there is a SAS GUI interface to this functionality called ADX. Collectively these offer additional features not found in DEW.

January 3, 2007

MNP - Multinomial Probit Regression models

MNP is a useful R program for modeling discrete choices, such as choosing among a finite number of alternatives. What makes this an interesting alternative to software such as Stata or Limdep is that model parameters are estimated via Bayesian MonteCarlo Markov Chain(MCMC) methods. Covariates are allowed and control over MCMC tuning is provided. Predictions under a model are available via the posterior predictive distribution.

Copula - modeling bivariate structural dependence

The word 'copula' originates from the Latin noun for a "link or tie" that connects two different things. Over the last decade or so, Copulas have found a niche in Economics and Finance for risk modeling of complex bivariate relationships. More broadly, these models can address structural dependencies in joint distributions that are rather surprising and useful. Matlab has a nice tool to explore these matters. Check out: http://www.mathworks.com/products/demos/statistics/copulademo.html. Alternatively, more specialized programs to deal with multivariate copulas and maximum likelihood estimation can be found in the R programs: copula, fgac, mlCopulaSlection, and msgcop.

December 28, 2006

Mplus - where latent ideas matter

About 30 years or so ago a modeling framework known as Structural Equation Models(SEM) became known for its ability to generalize and model observable and unobservable(latent) variables and to include specifications about multiple sources of variation. Early on Mplus was known for its treatment of models involving latent variables. Over the years Mplus has broaden its scope of features to more fully address mixed variable settings, advance simulation options, censoring/survival models, nonlinear growth and MultiLevel models. Some of the functionality in Mplus can be found in Spss's Amos product and SAS's Proc Calis feature, for example. Complex models such as these present unique chanllanges to the inference process. Both Mplus and Amos address this issue with optional bootstrapping of either residuals or observations. Whereas, R 's SEM package will bootstrap observations to estimate parameter standard errors. Additional information can be found at: http://www.statmodel.com/features.shtml

JMP - SAS for the Mac no longer...

JMP use to be a SAS Institute data analysis product for the Apple platform. This allowed SAS to leverage its vast array of statistical software available across many other platforms. Over the years the vendor added Windows/Intel machines. Now a Linux version is available. The latest version has grown to include many options, thus making it a more mainstream statistical package. Besides the rock solid underlying routines provided by SAS, the GUI is perhaps its real strength thus making it a productive enviornment for interactive use. A free 30-day full feature version is now available, check: http://www.jmp.com

December 21, 2006

NAG Statistical Add-ins for Excel

NAG is well known for providing numerical subroutines for scientific computing tasks for years. Statistical and visualization software components are also available. Recently an Excel add-in option providing 76 statistical functions is available for those that mosly use Excel as a statistical computing alternative to a larger statistics only package. Given NAG's excellent record for reliability and accuracy, one might consider this as a superior alternative to MicroSoft's add-in statistics option. Additional info at:http://www.nag.com/stats/ae_soft.asp

December 15, 2006

TISEAN - dealing with chaotic systems

The TISEAN software package is a collection of command line utilities addressing methods of nonlinear time series analysis. These are based on the paradigm of deterministic chaos. A variety of algorithms for data representation, prediction, noise reduction, dimension and Lyapunov estimation, and nonlinearity testing are discussed with particular emphasis on issues of implementation and choice of parameters. Source code in C and Fortran is publicly available. Support for GnuPlot is provided. TISEAN can be found at: http://www.mpipks-dresden.mpg.de/~tisean/TISEAN_2.1/index.html

StatXact - NonParametric Inference

StatXact software is a suite of statistical tools that provide exact inference in the nonparametric statistical inference setting. StatXact has a large number of statistical procedures addressing one, two or K-sample problems, problems involving contingency tables, measures of association and StatXact addresses stratified settings as well. These tools provide assurance of p-values in small sample, unbalanced or missing sampled data settings. A related product, LogXact is their exact logistic regression tool for similar situations. Added recently are options for Penalized Maximum Likelihood Estimation and methods for dealing with missing categorical covariates in GLM settings using Logit, Probit, CLogLog, Poisson, and Normal links. For additional info check out: http://www.cytel.com

December 12, 2006

Change-Point Analyzer

Change point problems can be found in many areas of science and engineering. The solutions vary according to the modeling setting. A search of the literature will reveal the large scope of solutions and settings. A shareware software option addressing the narrow areas of sequential processes, timeseries and control charts is Change-Point Analyze(CPA) . CPA analyzes time ordered data to determine whether a change has taken place. It detects multiple changes and provides both confidence levels and confidence intervals for each change. Check out the 30 day demo at: http://www.variation.com/cpa/

WiSP - abundance modeling for biologists

WiSP is an R library of functions designed as a teaching tool to illustrate methods used to estimate the abundance of closed animal populations. It enables users to generate animal populations having realistically complex spatial and individual characteristics, to generate survey designs for a variety of survey techniques, to survey the populations and to estimate abundance. WiSP can be found at: http://www.ruwpa.st-and.ac.uk/estimating.abundance/WiSP/

December 11, 2006

IVEware

Missing data problems and variance estimation in complex surveys is a standing problem facing most large scale surveys. IVEware was developed by the Survey Methodology Program at the University of Michigan's Survey Research Center, Institute for Social Research and is available to researchers without cost. A SAS interfacing version(requiring SAS) and stand alone Windows and Linux versions are available. Additional information is available at: http://www.isr.umich.edu/src/smp/ive/

XploRe

XploRe is a combination of classical and modern statistical procedures, in conjunction with sophisticated, interactive graphics. XploRe is the basis for statistical analysis, research, and teaching. Its purpose lies in the exploration and analysis of data, as well as in the development of new techniques. In addition, XploRe is a high level object-oriented programming language. XploRe is a complete statistical programming package, including a great variety of methods such as: generalized linear models and generalized partial linear models, nonparametric methods such as kernel estimation and smoothing, spline smoothing, single index models, generalized additive models, finanical option pricing, stock simulation, nonlinear time series analysis, and modern regression techniques with wavelets and neural networks. Both commercial and free academic versions are available. Additional information is available at: http://www.xplore-stat.de/index_js.html

Libra - a helpful sign for robust estimation

High dimensional data presents potential problems to standard modeling and estimation tasks commonly confronted by the data analyst. Most methods in most statistical packages do not address these issues. Robust estimation theory over the last 40 years has changed the landscape of ideas around what constitutes good practice and procedure. The goal of robust statistics is to develop data analytical methods which are resistant to outlying observations conditional on the model at hand and for a specified influence function. Such methods are able to discriminate outliers from model consistant data. LIBRA is an interesting collection of free Matlab programs designed for this very task. Further details can be found at: www.wis.kuleuven.ac.be/stat/robust.html

Octave - or is it Matlab? ...well maybe not, yet.

GNU Octave is a high-level language, primarily intended for numerical computations. It provides a convenient command line interface for solving linear and nonlinear problems numerically, and for performing other numerical experiments using a language that is mostly compatible with Matlab. It may also be used as a batch-oriented language for settings where long running programs are required. It is useful to think of Octave as a free alternative to Matlab. The community of users is large and productive, resulting in many freely available programs based on Octave. Visit the homepage at: http://www.gnu.org/software/octave/

WesVar - Replication-based approach to Surveys

Complex Surveys often require analysis techniques based on what is known as "Design Based" methods. This is a special branch of Statistics concerned with finite populations and complex survey designs with supporting estimation methods. Options to address this area of survey tools can be found in commerical packages such as SUDAAN, SAS, SPSS, STATA and others to varying degree. WesVar overlaps these options and offers jackknife and balanced repeated replication(BRR)methods to estimate variances of survey estimates. These methods correctly account for the effects of multistage complex survey designs with stratifed and unequal selection probabilities. For more information see: http://www.westat.com/westat/wesvar/about/index.html

December 7, 2006

SuperLU

SuperLU contains a set of subroutines to solve a sparse linear system A*X=B. This is often at the heart of many research computing tasks in science, engineering, statistical software. It uses Gaussian elimination with partial pivoting (GEPP). The columns of A may be preordered before factorization; the preordering for sparsity is completely separate from the factorization. SuperLU is implemented in ANSI C, and must be compiled with standard ANSI C compilers. It provides functionality for both real and complex matrices, in both single and double precision. In addition, a Matlab MEX interface option is available for access from within Matlab. Additional info may be found at: http://crd.lbl.gov/~xiaoye/SuperLU/

CAGED: cluster analysis of gene expression dynamics

CAGED is a unique Bayesian statistical tool for gene expression profiles that uses a time series approach to clustering. Markov models are used for within sequence representation and similiarity measures, such as entropy-based distances, are available for between gene sequence clustering purposes. Gene clusters are chosen on the basis of highest marginal likelihoods. CAGED is freely available after registration. For more info: http://genomethods.org/caged/

December 5, 2006

MCMCGLMM - C code for spatial GLMs

MCMCGLMM is a C based program for the fitting of Generalized Linear Models via Monte Carlo Markov Chain sampling. Powered-Exponential and Matern spatial covariance functions are available to capture spatial effects, and right and left censoring is available for Guassian and Binomial-logit models. Active developement of this software has stopped, and further developement has shifted to R as geoRglm. There is still some functionality in the C code not yet transfered to geoRglm. All things being equal, and for large datasets, the compiled version will execute much faster than geoRglm. In any case, the source code is available for those wishing to extend features. Additional information can be found at: http://www.math.aau.dk/~olefc/Programs/mcmcglmm/mcmcglmm.html

Morgan - Genetic Analysis modeling tool

Morgan is a set of statistical tools for genetic Pedigree Analysis on observed data with possible epidemiological attributes. Utilities are available for addressing issues about pedigree structure, kinship and inbreeding coefficients, Monte Carlo and MCMC techniques for simulation of marker and trait data, estimating conditional gene ibd probabilities, LOD scores, parameter estimation and Polygenic Modeling of quantitative traits by EM algoritm. Additional information may be found at: http://www.stat.washington.edu/thompson/Genepi/MORGAN/Morgan.shtml

Flexible Bayesian Modeling - FBM

Radford Neal has contributed much over the years to Bayesian regression and classification theory and the areas of neural networks and machine learning. His FBM C based software routines provide modern methods in these areas and more. This software supports Bayesian regression and classification models based on neural networks and Gaussian processes, and Bayesian density estimation and clustering using mixture models and Dirichlet diffusion trees. It also supports a variety of Markov chain sampling methods, which may be applied to distributions specified by simple formulas, including simple Bayesian models defined by formulas for the prior and likelihood. For additional information check: http://www.cs.toronto.edu/~radford/fbm.software.html

December 4, 2006

TimeSearcher 1 & 2 - TimeSeries Viewers

University of Maryland Computer Science Dept. has developed two free graphical exploration tools for discovery of structure in multiple timeseries. Linkage, zooming, timebox queries, leaders, laggers and other capabilities are some of the features used in exploring structure in series. For more information: http://www.cs.umd.edu/hcil/timesearcher/

SatScan - spatial clusters

SatScan is a free software tool that analyzes spatial, temporal and space-time data using the spatial, temporal, or space-time scan statistics. The software may be useful in any area for which clustering in time and/or space needs to be identified. Circular and elliptical scan windows are provided as well as an user option to create elliptical window composites/mixtures. Likelihood ratio scan statistics are computed for a variety of statistical models and options for covariate adjustments are available. Additional info may be found at: http://www.satscan.org/

December 1, 2006

SABRE - Analysis of Binary Recurrent Events

SABRE is a program created by the Center for Applied Statistics, Lancaster University, for the statistical analysis of multi-process random effect bivariate and trivariate response data. These responses can take the form of binary, ordinal, count and linear recurrent events in the clustered or longitudinal survey sampling settings. This is a fortran90 based code. Both serial and parallel versions are freely available. The parallel version may be of interest to those with large data sets. You may find out more at: http://www.lancs.ac.uk/staff/cpajp/sabre/index.html

BayesX - Bayesian semiparmetric and GAM modeling

BayesX regression tools relies on Markov Chain Monte Carlo simulation techniques and restricted maximum likelihood (REML) estimation. These techniques are used in support of mixed models, semiparametric regressions and survival models with structured additive predictors (STAR). STAR models cover a number of well known model classes as special cases, including generalized additive models(GAM), generalized additive mixed models, geoadditive models, varying coefficient models, and geographically weighted regressions. These methods are useful in non-spatial and spatial settings. Covariate effects within a GAM are specifed using P-splines. Additional info can be found at: http://www.stat.uni-muenchen.de/~bayesx/bayesx.html

November 30, 2006

Java tools and libraries

JScience tools are free java routines developed for the broad scientific community interested in coding scientific applications. Core modules addressing areas such as mathematics, physics, nerual networks, biology and others are in active developement. Details on what is available and how to register can be found at: JScience.

SOCR educational stats software

Educational statistical software has been an area for tinkering for some time. All sorts of approaches have been taken to explore what works and what doesn't. SOCR is a freely available web resource for statistical educational purposes. The electronic online Journal of Statistical Software has an article discussing SOCR.

November 28, 2006

Regression Methods via ARC

ARC is a framework for the exploration and graphical display of regression model structure and diagnostics. Focus is placed on understanding the conditional mean and variance functions, model structural dimension, nonlinearity, curvature, smoothing, transformation and model assessment. The uniqueness of the user interface was designed to allow interactive choice during all phases of use. Graphical regression, brushing and slicing allows for additional insights related to model building. In addition, extensions of these topics to the Generalized Linear Model framework allows for a larger class of models such as; binomial, logistic, poisson and gamma families.

BootStrapping options

Bootstrap options with tight intergation within some statistical methods have become availabe in recent years in packages such as SAS, Stata, Spss and Splus and Spss/AMos. The degree and ease of use varies greatly. Options to Bootstrap in packages hosting a sample with replacement method can allow one, in principle, to bootstrap an estimator of choice. Why bootstrap? One does so to achieve better sampling distributions of estimators. Bootstrap methods potentially offer insights into inference matters that might be difficult or impossible to reconcile otherwise. Small and large sample size settings can present complicated data configurations to estimation tasks such as parameter estimates or functions of one or more parameters. For example, settings such as complex surveys have seen bootstrap methods contribute to challanging survey estimation and inference tasks. For users of R one finds several contributed packages for bootstrap methods. Packages boot, bootstrap, pvclust, rqmcmb2, scaleboot, simpleboot, and Hmisc offer standard and advanced options not found in the some of the above commercial packages. Much has been written in the last 25 years about the bootstrap. Two useful references to consider are: Efron, B & Tibshirani, R.J. (1993), An Introduction to the Bootstrap, Chapman and Hall. And: Davison & Hinkley, (1997), Bootstrap Methods and their Applications, Cambridge Univ. Press.

November 27, 2006

Spatial point process modeling via spatstat

Spatial point pattern data are common across many areas of research. Software for extensive modeling is sparse and spread out across many disiplines. spatstat is a unified collection of tools developed from a modern persepective on spatial statistics. spatstat is a contributed R package. And like many of these packages, tools are provide for exploratory data analysis, point process specific graphical displays, and maximum pseudolikelihood model-fitting methods and diagnostics. Model formulation via Gibbs point processes allow one to address homogeneous and inhomgeneous Poisson, Strauss(hard and soft), Cox processes and others. Consideration and inclusion of covariates and multitype point patters(groups) are possible. The focus is on the definition and formulation of the conditional intensity function depending upon location, trend and interaction. Standard summary space functions and multitype versions of the empty space function F and variants G, K, J are available.

November 21, 2006

Visualization with XGobi, XGvis & GGobi

High dimensional data presents many problems related to the tasks of viewing and navigating. One of the first stats oriented software package to address these issues was AT&T's interactive visualization system, Xgobi http://www.research.att.com/areas/stat/xgobi/. Xgobi is not a stats package per se. Instead Xgobi provides various 1D, 2D and 3D displays in ways that use linkage(brushing) between displays, data IDs and various projection methods such as Grand Tours and Projection Pursuit. XGivs is an interactive MDS, MultiDimensional Scaling, package for proximity data as well as graphical networks. Note, the AT&T URL is a historical reference, but current and new developement of xgobi is now called GGobi, http://www.ggobi.org/.

November 20, 2006

Network Graph modeling resources in R

Graphical networks are those that can be conceptualized as nodes connected by one or more links. Links may be directed or not. Nodes can represent many things, such as concepts, people, tasks, relationships, etc. Some are referred to as; Social Networks, or Concept Maps or Directed Graphs. In many cases of analysis, the modeling of the node linkage structure is of interest, conditional on the graph. Also, visualization and descriptive summary measures of networks graphs are also required. There are several R based http://www.r-project.org/ modeling packages availabe to address simple and complex model structures, such as, logistic random effects, latent space clusters, linear exponential random network models and many more. Two R packages in particular specialize in this area; statnet and latentnet. StatNet http://csde.washington.edu/statnet/ can handle relatively large networks of about 3,000 nodes and provides tools for both model estimation and model-based network simulation. Latentnet is similar but provides access to latent position and cluster model structures. However if on the other hand, when your task is to uncover/discover what the graph is, conditional on observed node specific data, then consider some of the methods available in the Weka http://www.cs.waikato.ac.nz/ml/weka/ package addressing Bayes Net classification methods.

November 17, 2006

Power Analysis - with Piface

Often in the planning stages researchers will need to consider questions about effect and sample sizes needed to support their project and estimation/modeling tasks. Many funding agencies will require justification of sample size planning with Statistical Power methods. Careful attention to such matters is often not an easy task. Good software is necessary but not sufficient. Asking the right discipline specific questions concerning useful effect sizes is just as important. Various software solutions abound on the internet addressing Power Analysis. Piface is a useful no cost solution to many Power Analysis problem settings. Piface and useful commentary can be found at: http://www.stat.uiowa.edu/~rlenth/Power/

November 16, 2006

Data Mining the Weka way....

There is no one tool that is considered superior for purposes of Data Mining. Data Mining means different things to different displines and as a result, many solutions to different kinds of problems exist. A simple working definition of Data Mining is one that uses various tools to uncover structure from large amounts(tens of millions to billions of records) of high dimensional data(100s, 1000s or more variables) obtained as a consequence of natural or human systems under interaction. The explosion of data storage and acquistion over the last 30 years has created datasets from all areas of human investigation. The potential and incentive for understanding these structures presents research and business arbitrage opportunities. Weka is a collection of Machine Learning Algorithms written in Java. An interactive Gui is provided as well as a command line invocation capability for running multiple jobs. The tools offered in the base version of Weka is extensive. Data management, database connectivity, clustering, visualization, network modeling, prediction tools and validation methods are among its many features. Weka is available at http://www.cs.waikato.ac.nz/ml/weka

November 15, 2006

Bayesian Statistics - Bugs PkBugs and GeoBugs

Bayesian Statistics has evolved over the last 30 years or so with explosive growth and wide reaching theoretical contributions. Widespread adoption by applied researchers has been slow due to the lack of software, computational complexity and model formulations real world problems presents. The BUGS software project has made a substantial step in bridging these problems for small and moderate sized problems. The BUGS (Bayesian inference Using Gibbs Sampling) project is concerned with flexible software for the Bayesian analysis of complex statistical models using Markov chain Monte Carlo (MCMC) methods. GeoBUGS 1.2 is an extension for spatial analysis and PKBUGS is for pharmacokinetic modelling. For additional info see: http://www.mrcbsu.cam.ac.uk/bgs/welcome.shtml

Matlab's NonLinear regression tool - nlintool

Matlab offers a variety of Optimization functions in the Optimization and Statistics Toolboxes. One useful application for students is the Gui interface to nlinfit, called nlintool. This interactive graphical tool can be used for nonlinear least squares regression fitting and prediction for functions of one or more variables and parameters. As with all such tools, your mileage will vary depending on needs and data formats. The strength of this tool and interface is the relative ease of use and default outputs. The typical workflow setting is in the support of lab data analysis. You can investigate the nlinfit and nlintool documentation via the DEMOs help browser.

November 8, 2006

Spatial Statistics via spBayes

Spatial statistics is a large collections of tools with different historical developemental settings and results. History aside, one area that has been exploited recently is the class of models for univariate and multivariate hierarchical point-referenced spatial regression models for gaussian and non-guassian responses. The approach taken in spBayes is through generalized hierarchical random effects models estimated via Monte Carlo Markov Chain(MCMC) sampling. Spatial effects are captured via a zero centered multivariate guassian process where a variety of spatial covariance structures can be specified. A new R package http://www.r-project.org called spBayes addresses this area with more success than previous attempts. One advantage of the MCMC approach is the ability to estimate functionals. In particular, a recent entropy based measure call DIC, Deviance Information Criterion, is available to help consider the viability of competing nested or non nested models conditional on the same set of data.

November 6, 2006

Glimmix - not your ordinary regression

SAS is known as a large and powerful statistical program. Recently SAS offers access to an add-on PROC called Glimmix. Many years ago this toolset was developed as a user macro and it evolved to the point that SAS has turned it into a SAS/STAT PROC. However, it is not yet included in the STAT product under version 9.1. You must register and download this into your installed version of SAS. Also note that the 256 page documentation needs to be downloaded as well. From the SAS docs: The GLIMMIX procedure fits statistical models to data with correlations or nonconstant variability and where the response is not necessarily normally distributed. These models are known as generalized linear mixed models (GLMM). The GLMMs, like linear mixed models, assume normal (Gaussian) random effects. Conditional on these random effects, data can have any distribution in the exponential family. The exponential family comprises many of the elementary discrete and continuous distributions. The binary, binomial, Poisson, and negative binomial distributions, for example, are discrete members of this family. The normal, beta, gamma, and chi-square distributions are representatives of the continuous distributions in this family. In the absence of random effects, the GLIMMIX procedure fits generalized linear models (fit by the GENMOD procedure). Pratically speaking GLIMMIX is a cross between Proc GENMOD and Proc MIXED functionality. This is a huge plus to researchers needing to deal explicitly with the nature of their data instead of the more likely outcome of approximating a modeling effort with something not quite right for the problem at hand. For example, choosing a response distribution more closely aligned with your setting, exploring covariance structures for correlated data and nesting. In addition, thin plate spline modeling is available to address NonParametric Smoothing of covariate effects when in fact they may be nonlinear. Despite the additional capabilities, you may view this as a blessing or a curse.

Specializd Statistics Resouces

Often researchers will need access to functionality that isn't found in commercial statistics packages. This problem varies quite a bit and is meet with specialized solutions by the statistical community. These solutions are often cutting edge, reflecting new statistical research. Most stats packages allow some form of macro authorship. This works to a point and often provides a just in time solution. Well known examples include Matlab's scripting language, SAS IML, GAUSS, Stata, Splus and R. Yet others will seek stand alone solutions in one form or another. These range from public domain C, C++, Fortran, and Java research subrountines to stand-alone programs with various user interfaces. The goal of this blog is to list references and short descriptions of various solutions that may offer additional insights into your research and the statistical methods, and maybe even save you some time. About a dozen or so topics some to mind and I hope to address them shortly. These posts are not intended as statistical guidance nor endorsment. Most problems are best addressed by the advice of an experienced practioner in the relevant field.