Row 93691 - arrmlet/reddit_dataset

Data:

{
  "text": "data leakage in kNN is not really a concern I think, there is no training set and there is no training at all! the model IS the training set and the algorithm is nonparametric. \n\nThe only think I can imagine going wrong is OP using the example to label itself as one of the k Neighbours, but for k=1 it would return 100% accuracy since there's no distance between a point and itself. There can be duplicate rows and OP forgot to merge them, maybe.\n\nOP: let us know how many examples you have in your dataset, the class balance, if you normalized data, if you intervened to fill missing values, if you converted categorical features to numeric, what distance metric you used, handled noise etc etc. \n\nMy opinion is: k=1 might return the best result, but the point is ML is to generalize in order to get a sense out of data not only in the past but in the future, and k=1 might be a too narrow value to make a decision: what happens if an elderly patience comes in? does he get diagnosed with Alzheimer's even though he doesn't have it just because he is as old as the closest diseased patient? or does he not get diagnosed just because he is in the early stages and the closest true negative had similar values?",
  "label": "r/machinelearning",
  "dataType": "comment",
  "communityName": "r/MachineLearning",
  "datetime": "2024-05-25",
  "username_encoded": "Z0FBQUFBQm5Lak10M1N5Zjk1SHoyUWFUOS1XZVJ0ejJVb2l4V29pWTdGN1RJUXRXZ2dnMjZ4VXhnc1RVcWJQU3V3a0J5OUZZd0xfYnJtZDk0Wl9tRlpiVGlQX3Q3VmFlZ2o1VDN3TjlqYlViQy1OQmx3QkhRM009",
  "url_encoded": "Z0FBQUFBQm5Lak9fUTBMQlhWM1JZUXVTUE5FU005bHFZNTFmM0ZQLUU2NV9ZSThxQnJ6MW91aGo5eG1xZnVRRl9QQU4zRVBfVDdHby1zYy1KczJORUZLNVVjS19KTlRDNFVFUFFVaGNsSVE4MzRXRlhHdTk3Qzl3X2FxbGxIY2tucndVT2dueHFqSWU2MEtFMk5pc2NUdld3T0RpTGQ1YnBEQTBYR05BUjF3anFXY0g3djZDTkRSdGFXNnlrdTIwNFhWZ3lRd0VGSHF1ajduYldPNkZ4YzhZLThxMlBpV0lOdnVtUGE2LWRBbUUxTFU3T01KMFBNND0="
}
Row Details #93691

Data: