Pass percentiles to pandas agg function

Perhaps not super efficient, but one way would be to create a function yourself: def percentile(n): def percentile_(x): return np.percentile(x, n) percentile_.__name__ = ‘percentile_%s’ % n return percentile_ Then include this in your agg: In [11]: column.agg([np.sum, np.mean, np.std, np.median, np.var, np.min, np.max, percentile(50), percentile(95)]) Out[11]: sum mean std median var amin amax percentile_50 percentile_95 … Read more

Collapse / concatenate / aggregate a column to a single comma separated string within each group

Here are some options using toString, a function that concatenates a vector of strings using comma and space to separate components. If you don’t want commas, you can use paste() with the collapse argument instead. data.table # alternative using data.table library(data.table) as.data.table(data)[, toString(C), by = list(A, B)] aggregate This uses no packages: # alternative using … Read more

ListAGG in SQLSERVER

MySQL SELECT FieldA , GROUP_CONCAT(FieldB ORDER BY FieldB SEPARATOR ‘,’) AS FieldBs FROM TableName GROUP BY FieldA ORDER BY FieldA; Oracle & DB2 SELECT FieldA , LISTAGG(FieldB, ‘,’) WITHIN GROUP (ORDER BY FieldB) AS FieldBs FROM TableName GROUP BY FieldA ORDER BY FieldA; PostgreSQL SELECT FieldA , STRING_AGG(FieldB, ‘,’ ORDER BY FieldB) AS FieldBs FROM … Read more

Extract row corresponding to minimum value of a variable by group

Slightly more elegant: library(data.table) DT[ , .SD[which.min(Employees)], by = State] State Company Employees 1: AK D 24 2: RI E 19 Slighly less elegant than using .SD, but a bit faster (for data with many groups): DT[DT[ , .I[which.min(Employees)], by = State]$V1] Also, just replace the expression which.min(Employees) with Employees == min(Employees), if your data … Read more

Pandas sum by groupby, but exclude certain columns

You can select the columns of a groupby: In [11]: df.groupby([‘Country’, ‘Item_Code’])[[“Y1961”, “Y1962”, “Y1963”]].sum() Out[11]: Y1961 Y1962 Y1963 Country Item_Code Afghanistan 15 10 20 30 25 10 20 30 Angola 15 30 40 50 25 30 40 50 Note that the list passed must be a subset of the columns otherwise you’ll see a KeyError.

Pandas aggregate count distinct

How about either of: >>> df date duration user_id 0 2013-04-01 30 0001 1 2013-04-01 15 0001 2 2013-04-01 20 0002 3 2013-04-02 15 0002 4 2013-04-02 30 0002 >>> df.groupby(“date”).agg({“duration”: np.sum, “user_id”: pd.Series.nunique}) duration user_id date 2013-04-01 65 2 2013-04-02 45 1 >>> df.groupby(“date”).agg({“duration”: np.sum, “user_id”: lambda x: x.nunique()}) duration user_id date 2013-04-01 65 … Read more

Hata!: SQLSTATE[HY000] [1045] Access denied for user 'divattrend_liink'@'localhost' (using password: YES)