data-processing – Tarik Billa

Lua vs Embedded Lisp and potential other candidates. for set based data processing

September 12, 2023 by Tarik

I strongly agree with @jpjacobs’s points. Lua is an excellent choice for embedding, unless there’s something very specific about lisp that you need (for instance, if your data maps particularly well to cons-cells). I’ve used lisp for many many years, BTW, and I quite like lisp syntax, but these days I’d generally pick Lua. While … Read more

Ways to read only select columns from a file into R? (A happy medium between `read.table` and `scan`?) [duplicate]

June 8, 2023 by Tarik

how to use pandas filter with IQR

May 10, 2023 by Tarik

As far as I know, the most compact notation seems to be brought by the query method. # Some test data np.random.seed(33454) df = ( # A standard distribution pd.DataFrame({‘nb’: np.random.randint(0, 100, 20)}) # Adding some outliers .append(pd.DataFrame({‘nb’: np.random.randint(100, 200, 2)})) # Reseting the index .reset_index(drop=True) ) # Computing IQR Q1 = df[‘nb’].quantile(0.25) Q3 = … Read more

Best way to format large JSON file? (~30 mb)

March 25, 2023 by Tarik

With python >= 2.6 you can do the following: For Mac/Linux users: cat ugly.json | python -m json.tool > pretty.json For Windows users (thanks to the comment from dnk.nitro): type ugly.json | python -m json.tool > pretty.json

Large scale data processing Hbase vs Cassandra [closed]

January 29, 2023 by Tarik

As a Cassandra developer, I’m better at answering the other side of the question: Cassandra scales better. Cassandra is known to scale to over 400 nodes in a cluster; when Facebook deployed Messaging on top of HBase they had to shard it across 100-node HBase sub-clusters. Cassandra supports hundreds, even thousands of ColumnFamilies. “HBase currently … Read more