Tal Korem's avatar

Tal Korem

@tkorem.bsky.social

231 followers 157 following 47 posts

Microbiome, network inference, metabolism and reproductive health. All views are mine.


Tal Korem's avatar Tal Korem @tkorem.bsky.social
[ View ]

Importantly - we'd love to hear your comments, feedback, and GitHub issues! In particular if thereโ€™s additional prior work on this topic that we should note.

0 replies 0 reposts 0 likes


Tal Korem's avatar Tal Korem @tkorem.bsky.social
[ View ]

But CV is used not just for evaluation but also for hyperparameter tuning, and distributional bias impacts HPs that affect regression to the mean. For example, we show that it biases for weaker model regularization, which might affect generalization and downstream deployment.

1 replies 0 reposts 0 likes


Tal Korem's avatar Tal Korem @tkorem.bsky.social
[ View ]

With RebalancedCV we could see the "real-life" impact of distributional bias. We reproduced 3 recently published analyses that used LOOCV, and showed that it under-evaluated performance in all of them. While the effect isn't major, it is consistent.

1 replies 0 reposts 1 likes


Tal Korem's avatar Tal Korem @tkorem.bsky.social
[ View ]

With this in mind, we developed RebalancedCV, an sklearn-compatible package which drops the minimal amount of samples from the training set to maintain the same class balance in the training sets of all folds, thus resolving distributional bias. github.com/korem-lab/Re...

1 replies 0 reposts 1 likes


Tal Korem's avatar Tal Korem @tkorem.bsky.social
[ View ]

As the issue is caused by a shift in the class balance of the training set, distributional bias can be addressed with stratified CV - but only if your dataset allows it to happen precisely. The less exact the stratification - the more bias you have (in this plot, closer to 0).

1 replies 0 reposts 0 likes


Tal Korem's avatar Tal Korem @tkorem.bsky.social
[ View ]

Does this mean that past work with LOOCV is overinflated? Not quite. Most machine learning algorithms regress to the mean - not to its negative - and so they are actually _under_evaluated. That's the negative bias we started with!

1 replies 0 reposts 1 likes


Tal Korem's avatar Tal Korem @tkorem.bsky.social
[ View ]

Distributional bias is a severe information leakage - so severe that we designed a dummy model that can achieve perfect auROC/auPR in ANY binary classification task evaluated via LOOCV (even without features). How? it just outputs the negative mean of the training set labels!

1 replies 0 reposts 0 likes


Tal Korem's avatar Tal Korem @tkorem.bsky.social
[ View ]

The issue is that every time one holds out a sample as a test set in LOOCV, the mean label average of the training set shifts slightly, creating a perfect negative correlation across the folds between that mean and the test labels. We call this phenomenon distributional bias:

1 replies 0 reposts 1 likes


Tal Korem's avatar Tal Korem @tkorem.bsky.social
[ View ]

This story begins with benchmarking we did for some of our machine learning pipelines. We used random data, so we expected to see random classification accuracy (auROC=0.5). Instead, we found a clear negative bias, that got worse with more imbalanced datasets:

1 replies 0 reposts 1 likes


Tal Korem's avatar Tal Korem @tkorem.bsky.social
[ View ]

A bit of background: when training models on small datasets itโ€™s common to use LOOCV, as it maximizes the N of samples for training. It also leaves a single sample for testing, meaning that many performance metrics (e.g., area under ROC curve) require aggregation across folds/iterations.

1 replies 0 reposts 0 likes


Tal Korem's avatar Tal Korem @tkorem.bsky.social
[ View ]

New preprint!

In a few words: we don't think you should use leave-one-out cross-validation (LOOCV).

In a lot of words (+RebalancedCV, a LOOCV alternative): arxiv.org/abs/2406.01652

In a thread:

1 replies 3 reposts 4 likes


Tal Korem's avatar Tal Korem @tkorem.bsky.social
[ View ]

So, ISO an alternative search engine to Google, I guess. (It's 9:43 am CDT)

0 replies 0 reposts 0 likes


Tal Korem's avatar Tal Korem @tkorem.bsky.social
[ View ]

That's a cool idea and there's probably public data to do it -- thanks. What we did n that direction was to take paired long/short reads from the same sample, assume that the long reads are a real genome from the sample, and calculate bias with respect to them. It's not gone but it's much reduced.

0 replies 0 reposts 1 likes


Tal Korem's avatar Tal Korem @tkorem.bsky.social
[ View ]

Very apt sequence from the other place
( @baym.lol )

0 replies 0 reposts 1 likes


Tal Korem's avatar Tal Korem @tkorem.bsky.social
[ View ]

When we look at known strain mixtures - the bias goes away! (Fig 5)

1 replies 0 reposts 2 likes


Tal Korem's avatar Tal Korem @tkorem.bsky.social
[ View ]

๐Ÿ–ฅ๏ธ ๐Ÿงฌ #microbiome

0 replies 0 reposts 0 likes


Tal Korem's avatar Tal Korem @tkorem.bsky.social
[ View ]

Note:ย  We don't prove there is no leakage in Poore et al., and we don't discuss database errors. We tackle the general question of whether that specific observation is sufficient to indicate a problematic analysis.

0 replies 0 reposts 2 likes


Tal Korem's avatar Tal Korem @tkorem.bsky.social
[ View ]

Bottom line: If phenotype-associated originally empty features invalidate downstream analyses, then CLR is also invalid. Gihawi et al. did not present sufficient evidence to claim there is information leakage or error in machine learning (they do show that interpretation is complicated).

1 replies 0 reposts 1 likes


Tal Korem's avatar Tal Korem @tkorem.bsky.social
[ View ]

This is the same as predicting preterm birth with "Blautia (CLR)" or "empty feature (CLR)"ย  - creating a legitimate microbiome predictor - just not one thatโ€™s easy to interpret.

1 replies 0 reposts 1 likes


Tal Korem's avatar Tal Korem @tkorem.bsky.social
[ View ]

Wait, but didn't Gihawi et al. run Poore et al.'s code on a matrix of zeros and get an accurate classifier? Many got this impression, but the text is clear on what was done - they took a subset of the processed Voom-SNM matrix.

1 replies 0 reposts 1 likes


Tal Korem's avatar Tal Korem @tkorem.bsky.social
[ View ]

We can actually see this in three of the four examples that Gihawi et al highlight: a simple CLR transform (sample-wise - so no leakage) recreates the same observation of values associated with a tumor type. Here it is for a weird virus and adrenocortical carcinoma

1 replies 0 reposts 0 likes


Tal Korem's avatar Tal Korem @tkorem.bsky.social
[ View ]

Once more - this Blautia OTU is not really there, and it is definitely not related to preterm birth - but it is a real microbiome signature: it represents the (inverse) alpha diversity of the samples.

1 replies 1 reposts 1 likes


Tal Korem's avatar Tal Korem @tkorem.bsky.social
[ View ]

But what's biologically "real" about the geometric mean? so, for example, it's related to alpha diversity. To show this, we analyze a real vaginal microbiome dataset. We take the sparsest feature - probably not really there - and once again, after CLR, it's associated with preterm birth.

1 replies 0 reposts 2 likes


Tal Korem's avatar Tal Korem @tkorem.bsky.social
[ View ]

Is this feature biologically related to the phenotype? no. Is it really there? no. But is it leakage? no. This is a real microbiome signature! It represents the inverse of the sample geometric mean (and not, however, whatever is in the column title of the feature).

1 replies 0 reposts 0 likes


Tal Korem's avatar Tal Korem @tkorem.bsky.social
[ View ]

First, we simulate a 50:50 case:control study in which case samples have a higher geometric mean. We then add an all-zero feature. After CLR? That feature has values and they are perfectly associated with the phenotype.

1 replies 0 reposts 1 likes


Tal Korem's avatar Tal Korem @tkorem.bsky.social
[ View ]

We use counter-examples to show that it isn't. We use the CLR transform (with pseudocount), which is widely used in microbiome analysis. CLR uses only the taxa abundances within the samples - no phenotypes or other samples - so no information leakage or artificial tags.

1 replies 0 reposts 0 likes


Tal Korem's avatar Tal Korem @tkorem.bsky.social
[ View ]

Gihawi et al. found that taxa with high importance in tumor type classifiers by Poore et al. (2020) were almost all zero in the raw data. They claim this indicates an artificial tag, information leakage, and artifactual models. Is this a necessary conclusion?
journals.asm.org/doi/10.1128/...

1 replies 0 reposts 1 likes


Tal Korem's avatar Tal Korem @tkorem.bsky.social
[ View ]

Is finding taxa that were originally zero but have phenotype-associated values following batch correction a "major data analysis error" that invalidates downstream classifiers (in the 2020 cancer-microbiome paper or in general)?
We think it's not.
๐Ÿ“ฐ+๐Ÿงต
www.biorxiv.org/content/10.1...

3 replies 3 reposts 1 likes


Tal Korem's avatar Tal Korem @tkorem.bsky.social
[ View ]

๐ŸŸฆ๐ŸŸฆ๐ŸŸฆ๐ŸŸฆ ๐ŸŸฉ๐ŸŸช๐ŸŸช๐ŸŸฉ ๐ŸŸฉ๐ŸŸฉ๐ŸŸฉ๐ŸŸฉ ๐ŸŸจ๐ŸŸจ๐ŸŸจ๐ŸŸจ ๐ŸŸช๐ŸŸช๐ŸŸช๐ŸŸช ื–ื” ื”ื™ื” ืžืื•ื“ ื›ื™ืฃ

1 replies 0 reposts 2 likes


Tal Korem's avatar Tal Korem @tkorem.bsky.social
[ View ]

Cancer research is also metabolism, bioinformatics, genetics, comp bio, cell bio, etc etc, and no one would dare say it's not a discipline

1 replies 0 reposts 1 likes


Reposted by Tal Korem

Hiutung Chu 's avatar Hiutung Chu @chulab.bsky.social
[ View ]

Check out our latest paper on how human milk oligosaccharides trigger the B. fragilis colonization program, led by Katya Buzun, in the current issue of Cell Host & Microbe! ๐Ÿงช๐Ÿงซ๐Ÿฆ #MicroSky www.cell.com/cell-host-mi...

1 replies 9 reposts 16 likes


Tal Korem's avatar Tal Korem @tkorem.bsky.social
[ View ]

ืœืžื” ื›ื•ืœื ืฉื•ื ืื™ื ืื•ืชื• ื›ืœ ื›ืš?

1 replies 0 reposts 1 likes


Reposted by Tal Korem

Tal Korem's avatar Tal Korem @tkorem.bsky.social
[ View ]

Interesting methodology + highly useful. What more can you ask for?

0 replies 1 reposts 4 likes


Tal Korem's avatar Tal Korem @tkorem.bsky.social
[ View ]

Come be my colleague!

A broad open-rank search in areas of quantitative / computational biology - Program for Mathematical Genomics, Department of Systems Biology, Columbia University.

Feel free to reach out with questions and please share.
jobs.sciencecareers.org/job/653640/f...

0 replies 20 reposts 9 likes


Reposted by Tal Korem

Jotham Suez (SuezLab@JHU)'s avatar Jotham Suez (SuezLab@JHU) @suez.bsky.social
[ View ]

๐Ÿšจ Please share ๐Ÿšจ The Suez Lab @JohnsHopkins is recruiting! We're seeking a senior bioinformatician with established (papers/preprints) experience in #microbiome #metagenomics. The position can be local, fully remote, or hybrid.

DM me for questions, apply here: jobs.jhu.edu/job/Baltimor...

0 replies 24 reposts 16 likes


Tal Korem's avatar Tal Korem @tkorem.bsky.social
[ View ]

ื‘ื”ืจื‘ื” ืฉื“ื•ืช ืงื˜ื ื™ื ื–ื” ื’ื ืื•ืžืจ ืชื•ืจ ืœืžื•ื ื™ื•ืช ื•ื›ื•ืณ - ื–ื” ืœื ื ื’ืžืจ ืฉื

0 replies 0 reposts 0 likes


Tal Korem's avatar Tal Korem @tkorem.bsky.social
[ View ]

All takes on the Karikรณ Nobel should eventually come down to "tax the rich, fund the research" but somehow it stops at getting pissed at tenure committees and study sections

0 replies 6 reposts 16 likes


Tal Korem's avatar Tal Korem @tkorem.bsky.social
[ View ]

WTAF. Last year the interim was 11 if I recall correctly

0 replies 0 reposts 1 likes


Tal Korem's avatar Tal Korem @tkorem.bsky.social
[ View ]

4 vaccine appts cancelled already. CVS pharmacist says no one has vaccines in stock and no orders forthcoming. Such a disgraceful public health failure.

0 replies 0 reposts 2 likes


Tal Korem's avatar Tal Korem @tkorem.bsky.social
[ View ]

Not sure I want an answer, but what happens if there's a shutdown over the 10/5 NIH deadline?

1 replies 0 reposts 1 likes


Tal Korem's avatar Tal Korem @tkorem.bsky.social
[ View ]

If only I'd taken control of my actions and priorities, I would've had time between 1-2 pm today, and all my days would have become satisfying

0 replies 0 reposts 1 likes


Tal Korem's avatar Tal Korem @tkorem.bsky.social
[ View ]

We probably need more data for this, but it would be cool to see if there are microbes that are "generally dormant", so that we could do this even in cross sectional studies

1 replies 0 reposts 1 likes


Tal Korem's avatar Tal Korem @tkorem.bsky.social
[ View ]

Awesome work. Any thoughts on how to practically incorporate this in community scale metabolic mods?

1 replies 0 reposts 1 likes


Tal Korem's avatar Tal Korem @tkorem.bsky.social
[ View ]

Beautiful work! Congrats Sean and team

1 replies 1 reposts 2 likes


Reposted by Tal Korem

Katie Mack's avatar Katie Mack @astrokatie.com
[ View ]

I feel like the definition of being at โ€œhigh riskโ€ from covid really needs to include โ€œhigh probability of messing up your health and life for months or yearsโ€ not just โ€œlikely to immediately kill you.โ€ A 9% chance that Iโ€™m gonna be sick for at least three months is a pretty high risk!

55 replies 560 reposts 1461 likes


Tal Korem's avatar Tal Korem @tkorem.bsky.social
[ View ]

The person constantly nodding in the audience

0 replies 0 reposts 3 likes


Reposted by Tal Korem

Tal Korem's avatar Tal Korem @tkorem.bsky.social
[ View ]

One of my favorite things about this place is that the internet has an end again. (will probably be one of the first things they take away though)

0 replies 0 reposts 0 likes