September 14, 2007

LLMM - local linear mixed models

The generality of Mixed Models(Linear or NonLinear) is well known and has found extensive use in just about every applied research setting in the Social Sciences, Medicine, Engineering and GIS applications. Variations of these methods have been implemented in most major statistical packages. A recent article in Journal of Agricultrural, Biological, and Environmental Statistics( Sept. 2007) by Heegaard and Nilsen discuss theory and application in a biological spatial setting. LLMM models can be thought of as complementary to GAMM models but with additional user control of the fixed and stochastic structures for purposes of spatial smoothing. An implementation of LLMM as a contributed R package can be found at: http://eecrg.uib.no/personal_pages/LLMM.htm

June 21, 2007

The Bias Project

The Bias Project is an effort to produce research and software addressing Bayesian methods for combining individual and aggregate data sources as seen in Ecological Inference and Small Area Estimation problems. Many of the software contributions are available as either either Winbugs or R programs. The Bias Project is one of several research efforts at ESRC where unique problems in Social Science research are addressed and explored with modern statistical methods.

June 19, 2007

MIXPREG - mixed effects models

MIXPREG, MIXOR, MIXREG, MIXGSUR and MIXNO are a collection of high quality generalized regression modeling tools that allows for mixed effects and various forms of censoring. Correlated, nested and hierarchical data structures are also addressed. Estimation and inference is based on a full-information maximum likelihood approach rather than a Taylor expansion to linearized the likelihood. The software is free and WinXp, Solaris and Mac versions are available.

May 29, 2007

Solar

SOLAR stands for Sequential Oligogenic Linkage Analysis Routines. SOLAR addresses genetic variance components analysis, including linkage analysis, quantitative genetic analysis, and covariate screening. Two basic types of linkage analysis are available, Twopoint and Multipoint. Maximum Likelihood estimation, Monte Carlo Simulations and Bayesian Model Averaging are some of the options available to address model formation and screening. SOLAR is available on Tufts Bioinformatic Server where larger computational intensive jobs may be run.

CsPro - Census.Gov survey program

CSPro (Census and Survey Processing System) is a public-domain MS Windows software package for entering, editing, tabulating and mapping census and survey data. For those interested in a facility to capture and record new data, CsPro offers a simple interface to support this task. Support for survey Cross Tabulation is available but limited in scope. The other notable feature is the Mapping capability and viewer. Note, this package is not a complete statistical processing option nor an alternative to a GIS solution. Export of selected data/variables as ascii delimited files is available for input to other packages, such as SAS, SPSS.

April 24, 2007

NEOS: AMPL and GAMS server

GAMS (General Algebraic modeling system) is a high level mathemaitcal and optimization programming system. It was orginally developed by the World Bank and has since been further developed by Gams Developement Corp. Problems that can be cast as optimization problems can find a home in GAMS. Dozens of optimization solvers are available for specialized large scale problems. Check the Model Library for subject specific examples. AMPL is similar to GAMS and newer in its developement. Both offer various licensing options in addition to demo or student licenses. NEOS is a free Dept. of Energy computational server hosting GAMS and AMPL code execution. Several interface options are available for submitting codes. The easiest is the web interface to a particular solver of interest. These interfaces are not a subsitute for a license, since there are some restrictions, but may be enough for your problem setting.

March 28, 2007

StocNet - social network analysis

Stocnet is free software designed to address aspects of modeling networks. It offers five different statistical (stochastic) methods to estimate networks, and to calculate some common descriptive network statistics, offer some data transformation and/or graph selection capabilities and explore network simulation possibilities. The five models include: p* models(ERGMs), blockmodeling, p2 models, ultrametric methods for clustering and ZO methods for undirected graphs/networks. Relatedly, the homepage of one of the authors of Stocnet, Tom Snijders, contains useful information about Social Network Analysis.

March 27, 2007

CARMA Resources

Tufts University Information Technology(UIT) Academic Technology(AT) group recently acquired a Tufts subscription to the Center for the Advancement of Research Methods and Analysis(CARMA) website. CARMA provides video lectures addressing statistical research methologies widely used in the Social Sciences. The presentations are designed to be tutorial in nature and self-contained. Many topics are presented at the upper undergrad and graduate school level. Presenters are CARMA Fellows from universities throughout the U.S. Most streaming lectures are 60-90 minutes in duration and include downloadable powerpoint slides. Lecture presentations have included topics such as Limited Dependent Variable regression, Structural Equations Models, Meta-Analysis, NonResponse in Surveys, Robust Regression, Item Response Theory, Latent Growth models, Hierarchical Modeling and many more. A schedule for Spring 2007 Webcasts is available on their website along with archieved lectures. Tufts faculty and students are required to register on the CARMA site with a Tufts email address and to obtain a password for access.

March 26, 2007

Gllamm: Generalized Linear Latent & Mixed Models

gllamm is a user contributed STATA program to handle a very wide variety of models for addressing multilevel latent and mixed variable models. There are three components to gllamm: estimations tasks(gllamm), post-estimation predictions tasks(gllapred) and simulation(gllasim). Why would one care? In some settings having a unified treatment of estimation for many seemingly unrelated models can help one gain insights into applications and estimation inter-relationships. For example the following models: GLMMs, Multilevel Regressions, Factor models, Item Response, SEM models and Latent Class models are all special cases. If you have access to STATA follow the instructions to download and install gllamm.

February 9, 2007

Extreme value random variables

Most major stats packages offer some modeling and inference capability for extreme value random variables. Typically one finds such functionality within Survial Analysis types of routines. For example, SAS offers Proc LIFEREG procedure which fits parametric models to failure time data that can be right-, left-, or interval-censored for a variety of extreme value distributions. Additional related functionality can be found in Proc PHREG, Proc LIFETEST and Proc TPHREG. Another userful source for additional extreme value functionality not found in SAS or SPSS routines can be found in R. The following list of contributed packages addressing this topic include: evd, evdbayes, evir, extRemes and ismev. For example the evd package offers simulation, distribution, quantile and density functions to univariate and multivariate parametric extreme value distributions, and provides fitting functions which calculate maximum likelihood estimates for univariate and bivariate maxima models, and for univariate and bivariate threshold models. And the evdbayes package offers functions for the Bayesian analysis of extreme value models, using MCMC methods. Remaining listed R packages address topic such as: exploratory data analysis, block maxima, peaks over thresholds (univariate and bivariate), point processes, gev/gpd distributions. .....now that is extreme.....

January 31, 2007

DAWeb - Decision Analysis Society

This post is a bit of a digression from previous specific software tool discussions, but interesting nonetheless since so many people don't really think of Decision Support and Analysis as an area of study. Decision Analysis is a broad area of study. The fields of Statistics, Economics, Operations Research and Psychology(to name a few) have contributed many facets of foundational understanding and contributions to real world complex application settings. DAWeb is the web site of the Decision Analysis Society. With regard to software, the various links offer numerous options for those interested in specialized offerings.

January 17, 2007

Norm and Missing Data options

During the 1980s Rubin, Little and others established the statistical foundations of Missing Data problems. A Bayesian statistical justification for Multiple Imputation methods provided a principled approach to "fill in" missing data and pooling estimates across solutions based on completed data. NORM by Joseph Schafer is a late 1990's easy to use Windows based program that implements the methods of Rubin and Little. A major benefit of NORM is ease of use and the author's excellent commentary concerning guidance and theoretical contributions. One downside is that for some workflow styles the interface becomes a burden. In addition, the educational benefit of NORM is not to be overlooked. One the other hand, NORM, SAS and STATA have incorporated missing data routines that are better integrated with their other statistical models/methods. This aspect reduces the burden of use in a more general data analysis setting.

January 4, 2007

DEW - Experimental Design program

DEW is a web based program to help plan Design of Experiments(DOE) designs. Currently block designs, general factorial designs, response surface designs, and more are available. Some output is available as cut-n-paste tables, other options include R code or GenStat sample code. There are many programs devoted to DOE, DEW is easy to use and may meet some of your needs. On the other hand, if you have access to SAS several DOE programs are available. The SAS/STAT product has routine Proc Plan and the SAS/QC product offers Proc Factex and Proc Optex. In addition, there is a SAS GUI interface to this functionality called ADX. Collectively these offer additional features not found in DEW.

January 3, 2007

MNP - Multinomial Probit Regression models

MNP is a useful R program for modeling discrete choices, such as choosing among a finite number of alternatives. What makes this an interesting alternative to software such as Stata or Limdep is that model parameters are estimated via Bayesian MonteCarlo Markov Chain(MCMC) methods. Covariates are allowed and control over MCMC tuning is provided. Predictions under a model are available via the posterior predictive distribution.

Copula - modeling bivariate structural dependence

The word 'copula' originates from the Latin noun for a "link or tie" that connects two different things. Over the last decade or so, Copulas have found a niche in Economics and Finance for risk modeling of complex bivariate relationships. More broadly, these models can address structural dependencies in joint distributions that are rather surprising and useful. Matlab has a nice tool to explore these matters. Check out: http://www.mathworks.com/products/demos/statistics/copulademo.html. Alternatively, more specialized programs to deal with multivariate copulas and maximum likelihood estimation can be found in the R programs: copula, fgac, mlCopulaSlection, and msgcop.

December 28, 2006

Mplus - where latent ideas matter

About 30 years or so ago a modeling framework known as Structural Equation Models(SEM) became known for its ability to generalize and model observable and unobservable(latent) variables and to include specifications about multiple sources of variation. Early on Mplus was known for its treatment of models involving latent variables. Over the years Mplus has broaden its scope of features to more fully address mixed variable settings, advance simulation options, censoring/survival models, nonlinear growth and MultiLevel models. Some of the functionality in Mplus can be found in Spss's Amos product and SAS's Proc Calis feature, for example. Complex models such as these present unique chanllanges to the inference process. Both Mplus and Amos address this issue with optional bootstrapping of either residuals or observations. Whereas, R 's SEM package will bootstrap observations to estimate parameter standard errors. Additional information can be found at: http://www.statmodel.com/features.shtml

JMP - SAS for the Mac no longer...

JMP use to be a SAS Institute data analysis product for the Apple platform. This allowed SAS to leverage its vast array of statistical software available across many other platforms. Over the years the vendor added Windows/Intel machines. Now a Linux version is available. The latest version has grown to include many options, thus making it a more mainstream statistical package. Besides the rock solid underlying routines provided by SAS, the GUI is perhaps its real strength thus making it a productive enviornment for interactive use. A free 30-day full feature version is now available, check: http://www.jmp.com

December 21, 2006

NAG Statistical Add-ins for Excel

NAG is well known for providing numerical subroutines for scientific computing tasks for years. Statistical and visualization software components are also available. Recently an Excel add-in option providing 76 statistical functions is available for those that mosly use Excel as a statistical computing alternative to a larger statistics only package. Given NAG's excellent record for reliability and accuracy, one might consider this as a superior alternative to MicroSoft's add-in statistics option. Additional info at:http://www.nag.com/stats/ae_soft.asp

December 15, 2006

TISEAN - dealing with chaotic systems

The TISEAN software package is a collection of command line utilities addressing methods of nonlinear time series analysis. These are based on the paradigm of deterministic chaos. A variety of algorithms for data representation, prediction, noise reduction, dimension and Lyapunov estimation, and nonlinearity testing are discussed with particular emphasis on issues of implementation and choice of parameters. Source code in C and Fortran is publicly available. Support for GnuPlot is provided. TISEAN can be found at: http://www.mpipks-dresden.mpg.de/~tisean/TISEAN_2.1/index.html

StatXact - NonParametric Inference

StatXact software is a suite of statistical tools that provide exact inference in the nonparametric statistical inference setting. StatXact has a large number of statistical procedures addressing one, two or K-sample problems, problems involving contingency tables, measures of association and StatXact addresses stratified settings as well. These tools provide assurance of p-values in small sample, unbalanced or missing sampled data settings. A related product, LogXact is their exact logistic regression tool for similar situations. Added recently are options for Penalized Maximum Likelihood Estimation and methods for dealing with missing categorical covariates in GLM settings using Logit, Probit, CLogLog, Poisson, and Normal links. For additional info check out: http://www.cytel.com

December 12, 2006

Change-Point Analyzer

Change point problems can be found in many areas of science and engineering. The solutions vary according to the modeling setting. A search of the literature will reveal the large scope of solutions and settings. A shareware software option addressing the narrow areas of sequential processes, timeseries and control charts is Change-Point Analyze(CPA) . CPA analyzes time ordered data to determine whether a change has taken place. It detects multiple changes and provides both confidence levels and confidence intervals for each change. Check out the 30 day demo at: http://www.variation.com/cpa/

WiSP - abundance modeling for biologists

WiSP is an R library of functions designed as a teaching tool to illustrate methods used to estimate the abundance of closed animal populations. It enables users to generate animal populations having realistically complex spatial and individual characteristics, to generate survey designs for a variety of survey techniques, to survey the populations and to estimate abundance. WiSP can be found at: http://www.ruwpa.st-and.ac.uk/estimating.abundance/WiSP/

December 11, 2006

IVEware

Missing data problems and variance estimation in complex surveys is a standing problem facing most large scale surveys. IVEware was developed by the Survey Methodology Program at the University of Michigan's Survey Research Center, Institute for Social Research and is available to researchers without cost. A SAS interfacing version(requiring SAS) and stand alone Windows and Linux versions are available. Additional information is available at: http://www.isr.umich.edu/src/smp/ive/

XploRe

XploRe is a combination of classical and modern statistical procedures, in conjunction with sophisticated, interactive graphics. XploRe is the basis for statistical analysis, research, and teaching. Its purpose lies in the exploration and analysis of data, as well as in the development of new techniques. In addition, XploRe is a high level object-oriented programming language. XploRe is a complete statistical programming package, including a great variety of methods such as: generalized linear models and generalized partial linear models, nonparametric methods such as kernel estimation and smoothing, spline smoothing, single index models, generalized additive models, finanical option pricing, stock simulation, nonlinear time series analysis, and modern regression techniques with wavelets and neural networks. Both commercial and free academic versions are available. Additional information is available at: http://www.xplore-stat.de/index_js.html

Libra - a helpful sign for robust estimation

High dimensional data presents potential problems to standard modeling and estimation tasks commonly confronted by the data analyst. Most methods in most statistical packages do not address these issues. Robust estimation theory over the last 40 years has changed the landscape of ideas around what constitutes good practice and procedure. The goal of robust statistics is to develop data analytical methods which are resistant to outlying observations conditional on the model at hand and for a specified influence function. Such methods are able to discriminate outliers from model consistant data. LIBRA is an interesting collection of free Matlab programs designed for this very task. Further details can be found at: www.wis.kuleuven.ac.be/stat/robust.html

Octave - or is it Matlab? ...well maybe not, yet.

GNU Octave is a high-level language, primarily intended for numerical computations. It provides a convenient command line interface for solving linear and nonlinear problems numerically, and for performing other numerical experiments using a language that is mostly compatible with Matlab. It may also be used as a batch-oriented language for settings where long running programs are required. It is useful to think of Octave as a free alternative to Matlab. The community of users is large and productive, resulting in many freely available programs based on Octave. Visit the homepage at: http://www.gnu.org/software/octave/

WesVar - Replication-based approach to Surveys

Complex Surveys often require analysis techniques based on what is known as "Design Based" methods. This is a special branch of Statistics concerned with finite populations and complex survey designs with supporting estimation methods. Options to address this area of survey tools can be found in commerical packages such as SUDAAN, SAS, SPSS, STATA and others to varying degree. WesVar overlaps these options and offers jackknife and balanced repeated replication(BRR)methods to estimate variances of survey estimates. These methods correctly account for the effects of multistage complex survey designs with stratifed and unequal selection probabilities. For more information see: http://www.westat.com/westat/wesvar/about/index.html

December 7, 2006

SuperLU

SuperLU contains a set of subroutines to solve a sparse linear system A*X=B. This is often at the heart of many research computing tasks in science, engineering, statistical software. It uses Gaussian elimination with partial pivoting (GEPP). The columns of A may be preordered before factorization; the preordering for sparsity is completely separate from the factorization. SuperLU is implemented in ANSI C, and must be compiled with standard ANSI C compilers. It provides functionality for both real and complex matrices, in both single and double precision. In addition, a Matlab MEX interface option is available for access from within Matlab. Additional info may be found at: http://crd.lbl.gov/~xiaoye/SuperLU/

CAGED: cluster analysis of gene expression dynamics

CAGED is a unique Bayesian statistical tool for gene expression profiles that uses a time series approach to clustering. Markov models are used for within sequence representation and similiarity measures, such as entropy-based distances, are available for between gene sequence clustering purposes. Gene clusters are chosen on the basis of highest marginal likelihoods. CAGED is freely available after registration. For more info: http://genomethods.org/caged/

December 5, 2006

MCMCGLMM - C code for spatial GLMs

MCMCGLMM is a C based program for the fitting of Generalized Linear Models via Monte Carlo Markov Chain sampling. Powered-Exponential and Matern spatial covariance functions are available to capture spatial effects, and right and left censoring is available for Guassian and Binomial-logit models. Active developement of this software has stopped, and further developement has shifted to R as geoRglm. There is still some functionality in the C code not yet transfered to geoRglm. All things being equal, and for large datasets, the compiled version will execute much faster than geoRglm. In any case, the source code is available for those wishing to extend features. Additional information can be found at: http://www.math.aau.dk/~olefc/Programs/mcmcglmm/mcmcglmm.html

Morgan - Genetic Analysis modeling tool

Morgan is a set of statistical tools for genetic Pedigree Analysis on observed data with possible epidemiological attributes. Utilities are available for addressing issues about pedigree structure, kinship and inbreeding coefficients, Monte Carlo and MCMC techniques for simulation of marker and trait data, estimating conditional gene ibd probabilities, LOD scores, parameter estimation and Polygenic Modeling of quantitative traits by EM algoritm. Additional information may be found at: http://www.stat.washington.edu/thompson/Genepi/MORGAN/Morgan.shtml

Flexible Bayesian Modeling - FBM

Radford Neal has contributed much over the years to Bayesian regression and classification theory and the areas of neural networks and machine learning. His FBM C based software routines provide modern methods in these areas and more. This software supports Bayesian regression and classification models based on neural networks and Gaussian processes, and Bayesian density estimation and clustering using mixture models and Dirichlet diffusion trees. It also supports a variety of Markov chain sampling methods, which may be applied to distributions specified by simple formulas, including simple Bayesian models defined by formulas for the prior and likelihood. For additional information check: http://www.cs.toronto.edu/~radford/fbm.software.html

December 4, 2006

TimeSearcher 1 & 2 - TimeSeries Viewers

University of Maryland Computer Science Dept. has developed two free graphical exploration tools for discovery of structure in multiple timeseries. Linkage, zooming, timebox queries, leaders, laggers and other capabilities are some of the features used in exploring structure in series. For more information: http://www.cs.umd.edu/hcil/timesearcher/

SatScan - spatial clusters

SatScan is a free software tool that analyzes spatial, temporal and space-time data using the spatial, temporal, or space-time scan statistics. The software may be useful in any area for which clustering in time and/or space needs to be identified. Circular and elliptical scan windows are provided as well as an user option to create elliptical window composites/mixtures. Likelihood ratio scan statistics are computed for a variety of statistical models and options for covariate adjustments are available. Additional info may be found at: http://www.satscan.org/

December 1, 2006

SABRE - Analysis of Binary Recurrent Events

SABRE is a program created by the Center for Applied Statistics, Lancaster University, for the statistical analysis of multi-process random effect bivariate and trivariate response data. These responses can take the form of binary, ordinal, count and linear recurrent events in the clustered or longitudinal survey sampling settings. This is a fortran90 based code. Both serial and parallel versions are freely available. The parallel version may be of interest to those with large data sets. You may find out more at: http://www.lancs.ac.uk/staff/cpajp/sabre/index.html

BayesX - Bayesian semiparmetric and GAM modeling

BayesX regression tools relies on Markov Chain Monte Carlo simulation techniques and restricted maximum likelihood (REML) estimation. These techniques are used in support of mixed models, semiparametric regressions and survival models with structured additive predictors (STAR). STAR models cover a number of well known model classes as special cases, including generalized additive models(GAM), generalized additive mixed models, geoadditive models, varying coefficient models, and geographically weighted regressions. These methods are useful in non-spatial and spatial settings. Covariate effects within a GAM are specifed using P-splines. Additional info can be found at: http://www.stat.uni-muenchen.de/~bayesx/bayesx.html

November 30, 2006

Java tools and libraries

JScience tools are free java routines developed for the broad scientific community interested in coding scientific applications. Core modules addressing areas such as mathematics, physics, nerual networks, biology and others are in active developement. Details on what is available and how to register can be found at: JScience.

SOCR educational stats software

Educational statistical software has been an area for tinkering for some time. All sorts of approaches have been taken to explore what works and what doesn't. SOCR is a freely available web resource for statistical educational purposes. The electronic online Journal of Statistical Software has an article discussing SOCR.

November 28, 2006

Regression Methods via ARC

ARC is a framework for the exploration and graphical display of regression model structure and diagnostics. Focus is placed on understanding the conditional mean and variance functions, model structural dimension, nonlinearity, curvature, smoothing, transformation and model assessment. The uniqueness of the user interface was designed to allow interactive choice during all phases of use. Graphical regression, brushing and slicing allows for additional insights related to model building. In addition, extensions of these topics to the Generalized Linear Model framework allows for a larger class of models such as; binomial, logistic, poisson and gamma families.

BootStrapping options

Bootstrap options with tight intergation within some statistical methods have become availabe in recent years in packages such as SAS, Stata, Spss and Splus and Spss/AMos. The degree and ease of use varies greatly. Options to Bootstrap in packages hosting a sample with replacement method can allow one, in principle, to bootstrap an estimator of choice. Why bootstrap? One does so to achieve better sampling distributions of estimators. Bootstrap methods potentially offer insights into inference matters that might be difficult or impossible to reconcile otherwise. Small and large sample size settings can present complicated data configurations to estimation tasks such as parameter estimates or functions of one or more parameters. For example, settings such as complex surveys have seen bootstrap methods contribute to challanging survey estimation and inference tasks. For users of R one finds several contributed packages for bootstrap methods. Packages boot, bootstrap, pvclust, rqmcmb2, scaleboot, simpleboot, and Hmisc offer standard and advanced options not found in the some of the above commercial packages. Much has been written in the last 25 years about the bootstrap. Two useful references to consider are: Efron, B & Tibshirani, R.J. (1993), An Introduction to the Bootstrap, Chapman and Hall. And: Davison & Hinkley, (1997), Bootstrap Methods and their Applications, Cambridge Univ. Press.

November 27, 2006

Spatial point process modeling via spatstat

Spatial point pattern data are common across many areas of research. Software for extensive modeling is sparse and spread out across many disiplines. spatstat is a unified collection of tools developed from a modern persepective on spatial statistics. spatstat is a contributed R package. And like many of these packages, tools are provide for exploratory data analysis, point process specific graphical displays, and maximum pseudolikelihood model-fitting methods and diagnostics. Model formulation via Gibbs point processes allow one to address homogeneous and inhomgeneous Poisson, Strauss(hard and soft), Cox processes and others. Consideration and inclusion of covariates and multitype point patters(groups) are possible. The focus is on the definition and formulation of the conditional intensity function depending upon location, trend and interaction. Standard summary space functions and multitype versions of the empty space function F and variants G, K, J are available.

November 21, 2006

Visualization with XGobi, XGvis & GGobi

High dimensional data presents many problems related to the tasks of viewing and navigating. One of the first stats oriented software package to address these issues was AT&T's interactive visualization system, Xgobi http://www.research.att.com/areas/stat/xgobi/. Xgobi is not a stats package per se. Instead Xgobi provides various 1D, 2D and 3D displays in ways that use linkage(brushing) between displays, data IDs and various projection methods such as Grand Tours and Projection Pursuit. XGivs is an interactive MDS, MultiDimensional Scaling, package for proximity data as well as graphical networks. Note, the AT&T URL is a historical reference, but current and new developement of xgobi is now called GGobi, http://www.ggobi.org/.

November 20, 2006

Network Graph modeling resources in R

Graphical networks are those that can be conceptualized as nodes connected by one or more links. Links may be directed or not. Nodes can represent many things, such as concepts, people, tasks, relationships, etc. Some are referred to as; Social Networks, or Concept Maps or Directed Graphs. In many cases of analysis, the modeling of the node linkage structure is of interest, conditional on the graph. Also, visualization and descriptive summary measures of networks graphs are also required. There are several R based http://www.r-project.org/ modeling packages availabe to address simple and complex model structures, such as, logistic random effects, latent space clusters, linear exponential random network models and many more. Two R packages in particular specialize in this area; statnet and latentnet. StatNet http://csde.washington.edu/statnet/ can handle relatively large networks of about 3,000 nodes and provides tools for both model estimation and model-based network simulation. Latentnet is similar but provides access to latent position and cluster model structures. However if on the other hand, when your task is to uncover/discover what the graph is, conditional on observed node specific data, then consider some of the methods available in the Weka http://www.cs.waikato.ac.nz/ml/weka/ package addressing Bayes Net classification methods.

November 17, 2006

Power Analysis - with Piface

Often in the planning stages researchers will need to consider questions about effect and sample sizes needed to support their project and estimation/modeling tasks. Many funding agencies will require justification of sample size planning with Statistical Power methods. Careful attention to such matters is often not an easy task. Good software is necessary but not sufficient. Asking the right discipline specific questions concerning useful effect sizes is just as important. Various software solutions abound on the internet addressing Power Analysis. Piface is a useful no cost solution to many Power Analysis problem settings. Piface and useful commentary can be found at: http://www.stat.uiowa.edu/~rlenth/Power/

November 16, 2006

Data Mining the Weka way....

There is no one tool that is considered superior for purposes of Data Mining. Data Mining means different things to different displines and as a result, many solutions to different kinds of problems exist. A simple working definition of Data Mining is one that uses various tools to uncover structure from large amounts(tens of millions to billions of records) of high dimensional data(100s, 1000s or more variables) obtained as a consequence of natural or human systems under interaction. The explosion of data storage and acquistion over the last 30 years has created datasets from all areas of human investigation. The potential and incentive for understanding these structures presents research and business arbitrage opportunities. Weka is a collection of Machine Learning Algorithms written in Java. An interactive Gui is provided as well as a command line invocation capability for running multiple jobs. The tools offered in the base version of Weka is extensive. Data management, database connectivity, clustering, visualization, network modeling, prediction tools and validation methods are among its many features. Weka is available at http://www.cs.waikato.ac.nz/ml/weka

November 15, 2006

Bayesian Statistics - Bugs PkBugs and GeoBugs

Bayesian Statistics has evolved over the last 30 years or so with explosive growth and wide reaching theoretical contributions. Widespread adoption by applied researchers has been slow due to the lack of software, computational complexity and model formulations real world problems presents. The BUGS software project has made a substantial step in bridging these problems for small and moderate sized problems. The BUGS (Bayesian inference Using Gibbs Sampling) project is concerned with flexible software for the Bayesian analysis of complex statistical models using Markov chain Monte Carlo (MCMC) methods. GeoBUGS 1.2 is an extension for spatial analysis and PKBUGS is for pharmacokinetic modelling. For additional info see: http://www.mrcbsu.cam.ac.uk/bgs/welcome.shtml

Matlab's NonLinear regression tool - nlintool

Matlab offers a variety of Optimization functions in the Optimization and Statistics Toolboxes. One useful application for students is the Gui interface to nlinfit, called nlintool. This interactive graphical tool can be used for nonlinear least squares regression fitting and prediction for functions of one or more variables and parameters. As with all such tools, your mileage will vary depending on needs and data formats. The strength of this tool and interface is the relative ease of use and default outputs. The typical workflow setting is in the support of lab data analysis. You can investigate the nlinfit and nlintool documentation via the DEMOs help browser.

November 8, 2006

Spatial Statistics via spBayes

Spatial statistics is a large collections of tools with different historical developemental settings and results. History aside, one area that has been exploited recently is the class of models for univariate and multivariate hierarchical point-referenced spatial regression models for gaussian and non-guassian responses. The approach taken in spBayes is through generalized hierarchical random effects models estimated via Monte Carlo Markov Chain(MCMC) sampling. Spatial effects are captured via a zero centered multivariate guassian process where a variety of spatial covariance structures can be specified. A new R package http://www.r-project.org called spBayes addresses this area with more success than previous attempts. One advantage of the MCMC approach is the ability to estimate functionals. In particular, a recent entropy based measure call DIC, Deviance Information Criterion, is available to help consider the viability of competing nested or non nested models conditional on the same set of data.

November 6, 2006

Glimmix - not your ordinary regression

SAS is known as a large and powerful statistical program. Recently SAS offers access to an add-on PROC called Glimmix. Many years ago this toolset was developed as a user macro and it evolved to the point that SAS has turned it into a SAS/STAT PROC. However, it is not yet included in the STAT product under version 9.1. You must register and download this into your installed version of SAS. Also note that the 256 page documentation needs to be downloaded as well. From the SAS docs: The GLIMMIX procedure fits statistical models to data with correlations or nonconstant variability and where the response is not necessarily normally distributed. These models are known as generalized linear mixed models (GLMM). The GLMMs, like linear mixed models, assume normal (Gaussian) random effects. Conditional on these random effects, data can have any distribution in the exponential family. The exponential family comprises many of the elementary discrete and continuous distributions. The binary, binomial, Poisson, and negative binomial distributions, for example, are discrete members of this family. The normal, beta, gamma, and chi-square distributions are representatives of the continuous distributions in this family. In the absence of random effects, the GLIMMIX procedure fits generalized linear models (fit by the GENMOD procedure). Pratically speaking GLIMMIX is a cross between Proc GENMOD and Proc MIXED functionality. This is a huge plus to researchers needing to deal explicitly with the nature of their data instead of the more likely outcome of approximating a modeling effort with something not quite right for the problem at hand. For example, choosing a response distribution more closely aligned with your setting, exploring covariance structures for correlated data and nesting. In addition, thin plate spline modeling is available to address NonParametric Smoothing of covariate effects when in fact they may be nonlinear. Despite the additional capabilities, you may view this as a blessing or a curse.

Specializd Statistics Resouces

Often researchers will need access to functionality that isn't found in commercial statistics packages. This problem varies quite a bit and is meet with specialized solutions by the statistical community. These solutions are often cutting edge, reflecting new statistical research. Most stats packages allow some form of macro authorship. This works to a point and often provides a just in time solution. Well known examples include Matlab's scripting language, SAS IML, GAUSS, Stata, Splus and R. Yet others will seek stand alone solutions in one form or another. These range from public domain C, C++, Fortran, and Java research subrountines to stand-alone programs with various user interfaces. The goal of this blog is to list references and short descriptions of various solutions that may offer additional insights into your research and the statistical methods, and maybe even save you some time. About a dozen or so topics some to mind and I hope to address them shortly. These posts are not intended as statistical guidance nor endorsment. Most problems are best addressed by the advice of an experienced practioner in the relevant field.