Savitzky–Golay Filter: 60 Years 🎂 of Smoothing Noisy Data

Header image for the post titled Savitzky–Golay Filter: 60 Years 🎂 of Smoothing Noisy Data

Sixty years ago, in July of 1964, Abraham Savitzky and Marcel J. E. Golay published a seminal paper in Analytical Chemistry where they introduced a technique designed to smooth out noise in spectroscopic data and enhance the signal without significantly distorting the underlying trends. The Savitzky-Golay filter was one of the first signal processing techniques that I’ve learned in grad school, and the simplicity and elegance of this method has captured my heart ever since. As we approach the 60th anniversary of this influential technique, let’s explore its development, applications, and the enduring impact on science.

Oh, and we are also going to implement it in Excel VBA.

Guidance for Industry: The Reference Section

Header image for the post titled Guidance for Industry: The Reference Section

Anyone working in the pharmaceutical industry must have a good comprehension of the applicable regulations. While attending courses and workshops is very helpful in staying up to date, there is really no substitute to reading the source documents. Here I keep my currated “Reference Section” of regulatory documents and guidance that I find myself referring back to all the time. It may be useful to anyone working in clinical sample testing and/or IVD development.

Setting up to run Mistral LLM locally

Header image for the post titled Setting up to run Mistral LLM locally

Mistral is a series of Large Language Models (LLM) developed by a French company Mistral AI. These models are notable for their high performance and efficiency in generating text. Many of Mistral’s models are open source and can be used without restrictions under the Apache 2.0 license. Here is a quick troubleshooting walkthrough on how I set up dolphin-2.5-mixtral-8x7b to run locally with cuBLAS back-end for hardware acceleration. I experienced a few annoying hangups in the process, so hopefully this little note can help anyone experiencing the same.

Design of Experiments (DOE): The Overview

Header image for the post titled Design of Experiments (DOE): The Overview

Design of Experiments (DOE) is a systematic approach to planning, conducting, and analyzing efficient scientific experiments. It is an indispensable tool in optimization of complex processes, especially in engineering and manufacturing. Unlike traditional one-variable-at-a-time methods, DOE involves simultaneously varying multiple factors to efficiently assess their individual and interactive effects on the outcome. The most obvious benefit of DOE is that it allows to dramatically reduce the number of experiments needed to characterize a system (well, depending on the design you choose). But more importantly, it allows us to uncover interactions between variables, something that one-variable-at-a-time testing simply cannot do.

This note gives a very top level overview of DOE as a technique, while a detailed look at various methods will come in subsequent posts.

SDTM vs. CDASH: Why do we need two standards?

Header image for the post titled SDTM vs. CDASH: Why do we need two standards?

In the realm of clinical research and data management, adherence to standardized formats and structures is paramount. Two such standards, CDASH (Clinical Data Acquisition Standards Harmonization) and SDTM (Study Data Tabulation Model), play crucial roles in ensuring consistency and interoperability across clinical trial data. There is a lot of overlap between CDASH and SDTM, which often creates a false impression that they serve the same purpose, or even compete with each other. In reality, the two standards are designed to complement each other, while addressing similar but distinct needs. In this note, we delve into the key differences between these two standards to better understand their respective roles in the clinical research landscape.

Hugo: The Basics

Header image for the post titled Hugo: The Basics

Hugo is an open-source static site generator (SSG) designed for speed and efficiency. It offers sub-second rendering times, freedom to organize your content to your heart’s desire, and a good toolset for setting up advanced file processing pipelines. Having suffered through dealt with many SSGs before, I was pleasantly surprised by the simplicity × power factor that Hugo offers. Here I document my experimentation with this tool.

Power Query: My Custom Functions

Header image for the post titled Power Query: My Custom Functions

Let’s face it, Microsoft Excel is the most impactful data analysis tool in existence. And since 2013, its impact has grown dramatically with the introduction of Power Query, a data refinement and transformation tool that allows the users to create analysis and reporting pipelines from a variety of sources. And while Power Query’s built-in functionality is quite… err… powerful, advanced users will quickly find the need to develop custom solutions. In this note, I keep a constantly growing collection of Power Query custom functions that I use across most of my projects.

Artificial Neural Networks: Embedding Models

Header image for the post titled Artificial Neural Networks: Embedding Models

Before we can process language-based data using Artificial Neural Networks (ANNs), we need to convert this data into some kind of a numerical representation. Embedding models are designed for this purpose. They transform language data into dense high-dimensional vectors that preserve the semantic associations between words. These vectors capture the essence of language data in a way that computers can understand and process. This note explores the most popular embedding model architectures, looks into how these models are trained, and discusses their critical role in Natural Language Processing (NLP).