missing-data – Tarik Billa

Leaving values blank if not passed in str.format

April 10, 2024 by Tarik

You can follow the recommendation in PEP 3101 and use a subclass Formatter: import string class BlankFormatter(string.Formatter): def __init__(self, default=””): self.default=default def get_value(self, key, args, kwds): if isinstance(key, str): return kwds.get(key, self.default) else: return string.Formatter.get_value(key, args, kwds) kwargs = {“name”: “mark”, “adj”: “mad”} fmt=BlankFormatter() print fmt.format(“My name is {name} and I’m really {adj}.”, **kwargs) # … Read more

Randomly insert NA’s values in a pandas dataframe

April 3, 2024 by Tarik

Here’s a way to clear exactly 10% of cells (or rather, as close to 10% as can be achieved with the existing data frame’s size). import random ix = [(row, col) for row in range(df.shape[0]) for col in range(df.shape[1])] for row, col in random.sample(ix, int(round(.1*len(ix)))): df.iat[row, col] = np.nan Here’s a way to clear cells … Read more

Replacing NAs in R with nearest value

January 6, 2024 by Tarik

python scikit-learn clustering with missing data

January 6, 2024 by Tarik

I think you can use an iterative EM-type algorithm: Initialize missing values to their column means Repeat until convergence: Perform K-means clustering on the filled-in data Set the missing values to the centroid coordinates of the clusters to which they were assigned Implementation import numpy as np from sklearn.cluster import KMeans def kmeans_missing(X, n_clusters, max_iter=10): … Read more

Fill in missing pandas data with previous non-missing value, grouped by key

January 5, 2024 by Tarik

You could perform a groupby/forward-fill operation on each group: import numpy as np import pandas as pd df = pd.DataFrame({‘id’: [1,1,2,2,1,2,1,1], ‘x’:[10,20,100,200,np.nan,np.nan,300,np.nan]}) df[‘x’] = df.groupby([‘id’])[‘x’].ffill() print(df) yields id x 0 1 10.0 1 1 20.0 2 2 100.0 3 2 200.0 4 1 20.0 5 2 200.0 6 1 300.0 7 1 300.0

Error in na.fail.default: missing values in object – but no missing values

January 3, 2024 by Tarik

Pandas Dataframe: Replacing NaN with row average

December 21, 2023 by Tarik

As commented the axis argument to fillna is NotImplemented. df.fillna(df.mean(axis=1), axis=1) Note: this would be critical here as you don’t want to fill in your nth columns with the nth row average. For now you’ll need to iterate through: m = df.mean(axis=1) for i, col in enumerate(df): # using i allows for duplicate columns # … Read more

Missing values in scikits machine learning

September 7, 2023 by Tarik

Missing values are simply not supported in scikit-learn. There has been discussion on the mailing list about this before, but no attempt to actually write code to handle them. Whatever you do, don’t use NaN to encode missing values, since many of the algorithms refuse to handle samples containing NaNs. The above answer is outdated; … Read more

Multivariate LSTM with missing values

August 31, 2023 by Tarik

As suggested by François Chollet (creator of Keras) in his book, one way to handle missing values is to replace them with zero: In general, with neural networks, it’s safe to input missing values as 0, with the condition that 0 isn’t already a meaningful value. The network will learn from exposure to the data … Read more

Pandas: print column name with missing values

August 10, 2023 by Tarik

df.isnull().any() generates a boolean array (True if the column has a missing value, False otherwise). You can use it to index into df.columns: df.columns[df.isnull().any()] will return a list of the columns which have missing values. df = pd.DataFrame({‘A’: [1, 2, 3], ‘B’: [1, 2, np.nan], ‘C’: [4, 5, 6], ‘D’: [np.nan, np.nan, np.nan]}) df Out: … Read more