I got really sick on Sunday evening and could not post the entry on time …
Either way here it is!
“Assessing the Performance of Human-Capable LLMs – Are LLMs Coming for Your Job?,” n.d. https://arxiv.org/abs/2410.16285
Today I read a #study about assessing performance of LLMs and humans for customer service like jobs.
It was easy to read. Quite interesting and had valid points. I like how they calculated helpfulness metrics based on turns to answer and then calculated final score from 0 to 100.
The whole research is based on questions which can be summarized as a set of problems usually found on #StackOverflow.
The conclusion is that current #LLMs are already capable of solving forum like questions better than humans. Pretty appealing and shows that in the future, help desk jobs probably will get more and more obsolete.
What I found remarkable is that average scores achieved without #RAG are similar to human responses is similar enough to conclude not much difference.
But enough good things about this research. Now some bad things.
First of all, there are no examples of communication between human - human and human - llms. It’s super important to put such things in a research paper. Without this, you can hide so many details about the entire evaluation process.
Second of all, what were the questions? There are no examples of them. Who came up with the questions? I can surely come up with a set of questions which LLMs will solve perfectly and humans would not do an other way around. This is crucial for research to have.
I mentioned this since the conclusion is based on metrics which can be easily manipulated by researchers to prove their thesis. If you make research in a field which is really blurry like #LLMs, you need to me explicit on datasets you use for evaluation. At least that’s my opinion.
It was easy to read. Quite interesting and had valid points. I like how they calculated helpfulness metrics based on turns to answer and then calculated final score from 0 to 100.
The whole research is based on questions which can be summarized as a set of problems usually found on #StackOverflow.
The conclusion is that current #LLMs are already capable of solving forum like questions better than humans. Pretty appealing and shows that in the future, help desk jobs probably will get more and more obsolete.
What I found remarkable is that average scores achieved without #RAG are similar to human responses is similar enough to conclude not much difference.
But enough good things about this research. Now some bad things.
First of all, there are no examples of communication between human - human and human - llms. It’s super important to put such things in a research paper. Without this, you can hide so many details about the entire evaluation process.
Second of all, what were the questions? There are no examples of them. Who came up with the questions? I can surely come up with a set of questions which LLMs will solve perfectly and humans would not do an other way around. This is crucial for research to have.
I mentioned this since the conclusion is based on metrics which can be easily manipulated by researchers to prove their thesis. If you make research in a field which is really blurry like #LLMs, you need to me explicit on datasets you use for evaluation. At least that’s my opinion.
“Time: Yes, It’s a Dimension, but No, It’s Not like Space,” n.d. https://bigthink.com/starts-with-a-bang/time-yes-dimension-not-like-space/
“How Do You Deploy in 10 Seconds?,” n.d. https://paravoce.bearblog.dev/how-do-you-deploy-in-10-seconds/
Simple #script which allows you to deploy an application to remote #host. It uses #rsync and Go as an application builder, but you can quickly change it for whatever purpose you need.
The author claims that it’s production ready, but I would say it is if your production-ready application does not need to be #reliable and #scallable.
Either way, I would still love to use it for my small web services running in various VPS.
The author claims that it’s production ready, but I would say it is if your production-ready application does not need to be #reliable and #scallable.
Either way, I would still love to use it for my small web services running in various VPS.