Model | Balanced Accuracy | Precision | Sample Size |
---|---|---|---|
anthropic/claude-3.5-sonnet | 0.901 | 0.832 | 4015 |
openai/gpt-4o | 0.862 | 0.780 | 5927 |
openai/gpt-4 | 0.839 | 0.832 | 5949 |
microsoft/wizardlm-2-8x22b | 0.838 | 0.797 | 5910 |
google/gemini-flash-1.5 | 0.820 | 0.752 | 3852 |
anthropic/claude-3-haiku:beta | 0.816 | 0.715 | 3778 |
google/gemini-pro | 0.808 | 0.743 | 4038 |
meta-llama/llama-3-70b-instruct:nitro | 0.803 | 0.696 | 3999 |
openai/gpt-3.5-turbo-0125 | 0.790 | 0.690 | 5770 |
mistralai/mixtral-8x7b-instruct:nitro | 0.781 | 0.855 | 3732 |
meta-llama/llama-3-8b-instruct:nitro | 0.768 | 0.664 | 4065 |
google/palm-2-chat-bison-32k | 0.755 | 0.667 | 4844 |
google/gemini-pro-1.5 | 0.754 | 0.645 | 4822 |
openchat/openchat-7b | 0.704 | 0.775 | 3905 |
mistralai/mistral-7b-instruct | 0.694 | 0.652 | 3940 |
microsoft/phi-3-mini-128k-instruct | 0.693 | 0.781 | 5345 |
perplexity/llama-3-sonar-small-32k-chat | 0.645 | 0.555 | 3767 |
google/gemma-7b-it | 0.522 | 0.477 | 4160 |
Model | Balanced Accuracy | Precision | Sample Size |
---|---|---|---|
anthropic/claude-3.5-sonnet | 0.819 | 0.652 | 3875 |
openai/gpt-4 | 0.812 | 0.646 | 3878 |
meta-llama/llama-3-70b-instruct:nitro | 0.811 | 0.644 | 3876 |
openai/gpt-4o | 0.811 | 0.653 | 3878 |
openai/gpt-3.5-turbo-0125 | 0.808 | 0.643 | 3878 |
google/gemini-pro-1.5 | 0.806 | 0.641 | 3871 |
microsoft/wizardlm-2-8x22b | 0.805 | 0.641 | 3867 |
anthropic/claude-3-haiku:beta | 0.804 | 0.640 | 3870 |
openchat/openchat-7b | 0.799 | 0.606 | 3875 |
google/gemini-flash-1.5 | 0.794 | 0.631 | 3877 |
google/gemini-pro | 0.789 | 0.625 | 3842 |
mistralai/mixtral-8x7b-instruct:nitro | 0.780 | 0.618 | 3878 |
google/palm-2-chat-bison-32k | 0.779 | 0.620 | 3876 |
perplexity/llama-3-sonar-small-32k-chat | 0.741 | 0.583 | 3875 |
meta-llama/llama-3-8b-instruct:nitro | 0.707 | 0.584 | 3793 |
mistralai/mistral-7b-instruct | 0.666 | 0.634 | 3843 |
microsoft/phi-3-mini-128k-instruct | 0.657 | 0.536 | 3869 |
google/gemma-7b-it | 0.595 | 0.530 | 3847 |
Model | Balanced Accuracy | Precision | Sample Size |
---|---|---|---|
openai/gpt-4o | 0.897 | 0.736 | 4356 |
google/gemini-flash-1.5 | 0.841 | 0.702 | 1507 |
google/gemini-pro-1.5 | 0.822 | 0.643 | 5635 |
meta-llama/llama-3-70b-instruct:nitro | 0.782 | 0.564 | 2331 |
anthropic/claude-3.5-sonnet | 0.772 | 0.522 | 3820 |
openai/gpt-4 | 0.761 | 0.508 | 5551 |
google/gemini-pro | 0.716 | 0.392 | 1509 |
google/palm-2-chat-bison-32k | 0.709 | 0.406 | 2398 |
microsoft/wizardlm-2-8x22b | 0.699 | 0.537 | 5085 |
openai/gpt-3.5-turbo-0125 | 0.671 | 0.514 | 6486 |
openchat/openchat-7b | 0.595 | 0.563 | 43 |
mistralai/mixtral-8x7b-instruct:nitro | 0.534 | 0.551 | 2226 |
mistralai/mistral-7b-instruct | 0.502 | 0.461 | 1179 |
microsoft/phi-3-mini-128k-instruct | 0.496 | 0.426 | 707 |
meta-llama/llama-3-8b-instruct:nitro | 0.399 | 0.396 | 1517 |
anthropic/claude-3-haiku:beta | 0.397 | 0.267 | 7 |
Model | Balanced Accuracy | Precision | Sample Size |
---|---|---|---|
anthropic/claude-3.5-sonnet | 0.723 | 0.240 | 4559 |
openai/gpt-4 | 0.719 | 0.442 | 6739 |
microsoft/wizardlm-2-8x22b | 0.687 | 0.228 | 6687 |
openai/gpt-4o | 0.664 | 0.174 | 6704 |
google/gemini-flash-1.5 | 0.639 | 0.152 | 4369 |
google/gemini-pro-1.5 | 0.636 | 0.160 | 5520 |
perplexity/llama-3-sonar-small-32k-chat | 0.633 | 0.181 | 4241 |
openai/gpt-3.5-turbo-0125 | 0.624 | 0.619 | 6454 |
meta-llama/llama-3-70b-instruct:nitro | 0.623 | 0.143 | 4542 |
google/palm-2-chat-bison-32k | 0.617 | 0.648 | 5506 |
mistralai/mixtral-8x7b-instruct:nitro | 0.612 | 0.148 | 4239 |
google/gemini-pro | 0.604 | 0.407 | 4574 |
microsoft/phi-3-mini-128k-instruct | 0.579 | 0.134 | 5993 |
mistralai/mistral-7b-instruct | 0.579 | 0.120 | 4443 |
meta-llama/llama-3-8b-instruct:nitro | 0.567 | 0.116 | 4594 |
openchat/openchat-7b | 0.562 | 0.237 | 4443 |
anthropic/claude-3-haiku:beta | 0.533 | 0.104 | 4287 |
google/gemma-7b-it | 0.500 | 0.100 | 4711 |
Model | Balanced Accuracy | Precision | Sample Size |
---|---|---|---|
openai/gpt-4 | 0.619 | 0.774 | 507 |
mistralai/mistral-7b-instruct | 0.602 | 0.757 | 332 |
openchat/openchat-7b | 0.600 | 0.778 | 338 |
google/gemini-pro-1.5 | 0.572 | 0.734 | 426 |
openai/gpt-4o | 0.557 | 0.736 | 504 |
google/gemini-pro | 0.551 | 0.743 | 357 |
google/palm-2-chat-bison-32k | 0.529 | 0.721 | 411 |
anthropic/claude-3.5-sonnet | 0.504 | 0.722 | 340 |
mistralai/mixtral-8x7b-instruct:nitro | 0.500 | 0.716 | 328 |
perplexity/llama-3-sonar-small-32k-chat | 0.500 | 0.719 | 317 |
google/gemma-7b-it | 0.500 | 0.713 | 363 |
meta-llama/llama-3-8b-instruct:nitro | 0.500 | 0.692 | 347 |
anthropic/claude-3-haiku:beta | 0.498 | 0.719 | 321 |
google/gemini-flash-1.5 | 0.497 | 0.707 | 319 |
microsoft/wizardlm-2-8x22b | 0.497 | 0.708 | 505 |
openai/gpt-3.5-turbo-0125 | 0.497 | 0.712 | 477 |
microsoft/phi-3-mini-128k-instruct | 0.491 | 0.713 | 451 |
meta-llama/llama-3-70b-instruct:nitro | 0.490 | 0.708 | 344 |
Model | Balanced Accuracy | Precision | Sample Size |
---|---|---|---|
google/gemini-flash-1.5 | 0.847 | 0.799 | 2670 |
anthropic/claude-3-haiku:beta | 0.843 | 0.803 | 2598 |
google/gemini-pro | 0.833 | 0.791 | 2758 |
meta-llama/llama-3-70b-instruct:nitro | 0.833 | 0.848 | 2754 |
openai/gpt-4o | 0.818 | 0.868 | 4097 |
google/gemini-pro-1.5 | 0.817 | 0.822 | 3374 |
microsoft/wizardlm-2-8x22b | 0.803 | 0.715 | 4093 |
anthropic/claude-3.5-sonnet | 0.792 | 0.882 | 2805 |
google/palm-2-chat-bison-32k | 0.790 | 0.831 | 3359 |
meta-llama/llama-3-8b-instruct:nitro | 0.779 | 0.824 | 2831 |
perplexity/llama-3-sonar-small-32k-chat | 0.758 | 0.696 | 2603 |
openai/gpt-4 | 0.752 | 0.892 | 4120 |
mistralai/mistral-7b-instruct | 0.742 | 0.860 | 2723 |
openai/gpt-3.5-turbo-0125 | 0.742 | 0.843 | 3934 |
openchat/openchat-7b | 0.740 | 0.861 | 2721 |
mistralai/mixtral-8x7b-instruct:nitro | 0.696 | 0.893 | 2604 |
google/gemma-7b-it | 0.685 | 0.580 | 2886 |
microsoft/phi-3-mini-128k-instruct | 0.593 | 0.851 | 3657 |
Model | Balanced Accuracy | Precision | Sample Size |
---|---|---|---|
anthropic/claude-3.5-sonnet | 0.916 | 0.671 | 5039 |
openai/gpt-4o | 0.894 | 0.641 | 7438 |
google/gemini-pro-1.5 | 0.888 | 0.597 | 6104 |
google/gemini-pro | 0.884 | 0.576 | 5063 |
meta-llama/llama-3-70b-instruct:nitro | 0.865 | 0.673 | 5027 |
anthropic/claude-3-haiku:beta | 0.847 | 0.507 | 4743 |
openai/gpt-4 | 0.842 | 0.603 | 7473 |
google/gemini-flash-1.5 | 0.834 | 0.489 | 4863 |
google/palm-2-chat-bison-32k | 0.831 | 0.749 | 6121 |
microsoft/wizardlm-2-8x22b | 0.817 | 0.551 | 7418 |
openai/gpt-3.5-turbo-0125 | 0.768 | 0.568 | 7180 |
mistralai/mistral-7b-instruct | 0.759 | 0.442 | 4907 |
perplexity/llama-3-sonar-small-32k-chat | 0.731 | 0.460 | 4717 |
microsoft/phi-3-mini-128k-instruct | 0.725 | 0.388 | 6663 |
mistralai/mixtral-8x7b-instruct:nitro | 0.713 | 0.616 | 4698 |
meta-llama/llama-3-8b-instruct:nitro | 0.702 | 0.448 | 5100 |
google/gemma-7b-it | 0.636 | 0.397 | 5220 |
openchat/openchat-7b | 0.614 | 0.422 | 4926 |
Model | Balanced Accuracy | Precision | Sample Size |
---|---|---|---|
mistralai/mistral-7b-instruct | 0.667 | 0.500 | 15 |
meta-llama/llama-3-8b-instruct:nitro | 0.636 | 0.467 | 18 |
google/gemma-7b-it | 0.636 | 0.333 | 15 |
openai/gpt-4 | 0.633 | 0.421 | 23 |
openai/gpt-4o | 0.567 | 0.381 | 23 |
google/gemini-pro | 0.550 | 0.471 | 18 |
google/palm-2-chat-bison-32k | 0.536 | 0.350 | 21 |
anthropic/claude-3-haiku:beta | 0.500 | 0.333 | 18 |
anthropic/claude-3.5-sonnet | 0.500 | 0.294 | 17 |
microsoft/wizardlm-2-8x22b | 0.500 | 0.348 | 23 |
meta-llama/llama-3-70b-instruct:nitro | 0.500 | 0.316 | 19 |
mistralai/mixtral-8x7b-instruct:nitro | 0.500 | 0.333 | 12 |
perplexity/llama-3-sonar-small-32k-chat | 0.500 | 0.357 | 14 |
google/gemini-flash-1.5 | 0.500 | 0.286 | 14 |
google/gemini-pro-1.5 | 0.500 | 0.300 | 20 |
openchat/openchat-7b | 0.482 | 0.300 | 16 |
openai/gpt-3.5-turbo-0125 | 0.464 | 0.273 | 20 |
microsoft/phi-3-mini-128k-instruct | 0.442 | 0.333 | 18 |
Model | Balanced Accuracy | Precision | Sample Size |
---|---|---|---|
openai/gpt-4o | 0.809 | 0.812 | 5332 |
openai/gpt-4 | 0.787 | 0.840 | 5357 |
anthropic/claude-3-haiku:beta | 0.777 | 0.662 | 3385 |
google/gemini-pro-1.5 | 0.759 | 0.682 | 4381 |
google/palm-2-chat-bison-32k | 0.759 | 0.805 | 4393 |
google/gemini-pro | 0.734 | 0.613 | 3617 |
anthropic/claude-3.5-sonnet | 0.733 | 0.652 | 3621 |
openai/gpt-3.5-turbo-0125 | 0.693 | 0.853 | 5129 |
microsoft/wizardlm-2-8x22b | 0.687 | 0.545 | 5327 |
openchat/openchat-7b | 0.653 | 0.551 | 3513 |
mistralai/mistral-7b-instruct | 0.651 | 0.492 | 3545 |
meta-llama/llama-3-70b-instruct:nitro | 0.651 | 0.505 | 3607 |
google/gemini-flash-1.5 | 0.631 | 0.484 | 3487 |
meta-llama/llama-3-8b-instruct:nitro | 0.601 | 0.463 | 3657 |
mistralai/mixtral-8x7b-instruct:nitro | 0.589 | 0.441 | 3369 |
google/gemma-7b-it | 0.583 | 0.631 | 3773 |
microsoft/phi-3-mini-128k-instruct | 0.582 | 0.461 | 4763 |
perplexity/llama-3-sonar-small-32k-chat | 0.499 | 0.398 | 3387 |
Model | Balanced Accuracy | Precision | Sample Size |
---|---|---|---|
google/gemini-pro-1.5 | 0.797 | 0.263 | 6136 |
google/gemini-flash-1.5 | 0.774 | 0.419 | 4884 |
meta-llama/llama-3-70b-instruct:nitro | 0.772 | 0.262 | 5050 |
anthropic/claude-3.5-sonnet | 0.756 | 0.689 | 5060 |
microsoft/wizardlm-2-8x22b | 0.752 | 0.362 | 7453 |
anthropic/claude-3-haiku:beta | 0.746 | 0.196 | 4760 |
openai/gpt-4 | 0.739 | 0.202 | 7507 |
openai/gpt-4o | 0.736 | 0.284 | 7467 |
openai/gpt-3.5-turbo-0125 | 0.730 | 0.421 | 7206 |
mistralai/mixtral-8x7b-instruct:nitro | 0.704 | 0.275 | 4712 |
meta-llama/llama-3-8b-instruct:nitro | 0.699 | 0.490 | 5119 |
google/palm-2-chat-bison-32k | 0.695 | 0.376 | 6139 |
mistralai/mistral-7b-instruct | 0.692 | 0.273 | 4961 |
openchat/openchat-7b | 0.689 | 0.158 | 4945 |
google/gemini-pro | 0.667 | 0.248 | 5090 |
perplexity/llama-3-sonar-small-32k-chat | 0.658 | 0.946 | 4744 |
google/gemma-7b-it | 0.656 | 0.148 | 5249 |
microsoft/phi-3-mini-128k-instruct | 0.629 | 0.216 | 6699 |
Model | Balanced Accuracy | Precision | Sample Size |
---|---|---|---|
google/gemini-pro-1.5 | 0.966 | 0.941 | 5944 |
google/gemini-flash-1.5 | 0.951 | 0.909 | 4730 |
anthropic/claude-3.5-sonnet | 0.938 | 0.982 | 4915 |
openai/gpt-4o | 0.932 | 0.935 | 7240 |
anthropic/claude-3-haiku:beta | 0.923 | 0.905 | 4606 |
openai/gpt-4 | 0.922 | 0.962 | 7276 |
meta-llama/llama-3-70b-instruct:nitro | 0.919 | 0.928 | 4899 |
mistralai/mixtral-8x7b-instruct:nitro | 0.918 | 0.792 | 4568 |
openai/gpt-3.5-turbo-0125 | 0.906 | 0.923 | 6996 |
google/gemini-pro | 0.898 | 0.927 | 4938 |
microsoft/wizardlm-2-8x22b | 0.887 | 0.955 | 7219 |
perplexity/llama-3-sonar-small-32k-chat | 0.882 | 0.665 | 4605 |
meta-llama/llama-3-8b-instruct:nitro | 0.866 | 0.562 | 4959 |
mistralai/mistral-7b-instruct | 0.864 | 0.891 | 4806 |
google/palm-2-chat-bison-32k | 0.860 | 0.937 | 5967 |
openchat/openchat-7b | 0.760 | 0.987 | 4777 |
microsoft/phi-3-mini-128k-instruct | 0.720 | 0.980 | 6475 |
google/gemma-7b-it | 0.682 | 0.969 | 5082 |
Model | Balanced Accuracy | Precision | Sample Size |
---|---|---|---|
google/palm-2-chat-bison-32k | 0.627 | 0.178 | 57 |
google/gemini-flash-1.5 | 0.591 | 0.406 | 495 |
openai/gpt-4o | 0.578 | 0.443 | 293 |
anthropic/claude-3-haiku:beta | 0.568 | 0.415 | 460 |
anthropic/claude-3.5-sonnet | 0.549 | 0.396 | 197 |
microsoft/wizardlm-2-8x22b | 0.538 | 0.427 | 204 |
google/gemini-pro-1.5 | 0.517 | 0.333 | 127 |
meta-llama/llama-3-70b-instruct:nitro | 0.516 | 0.239 | 96 |
openai/gpt-4 | 0.512 | 0.342 | 239 |
openai/gpt-3.5-turbo-0125 | 0.507 | 0.296 | 254 |
openchat/openchat-7b | 0.500 | 0.278 | 12 |
mistralai/mixtral-8x7b-instruct:nitro | 0.500 | 0.300 | 10 |
google/gemma-7b-it | 0.500 | 0.013 | 262 |
perplexity/llama-3-sonar-small-32k-chat | 0.485 | 0.171 | 76 |
meta-llama/llama-3-8b-instruct:nitro | 0.475 | 0.144 | 126 |
microsoft/phi-3-mini-128k-instruct | 0.473 | 0.458 | 19 |
google/gemini-pro | 0.460 | 0.585 | 152 |
mistralai/mistral-7b-instruct | 3 |
Model | Balanced Accuracy | Precision | Sample Size |
---|---|---|---|
openai/gpt-4o | 0.785 | 0.685 | 6597 |
anthropic/claude-3.5-sonnet | 0.784 | 0.603 | 4490 |
meta-llama/llama-3-70b-instruct:nitro | 0.783 | 0.619 | 4462 |
google/gemini-pro-1.5 | 0.780 | 0.583 | 5400 |
microsoft/wizardlm-2-8x22b | 0.773 | 0.554 | 6573 |
openai/gpt-3.5-turbo-0125 | 0.765 | 0.637 | 6343 |
anthropic/claude-3-haiku:beta | 0.756 | 0.631 | 4206 |
google/gemini-pro | 0.751 | 0.568 | 4467 |
openai/gpt-4 | 0.738 | 0.601 | 6626 |
google/gemini-flash-1.5 | 0.729 | 0.564 | 4306 |
google/palm-2-chat-bison-32k | 0.712 | 0.576 | 5440 |
mistralai/mixtral-8x7b-instruct:nitro | 0.702 | 0.563 | 4127 |
mistralai/mistral-7b-instruct | 0.653 | 0.509 | 4371 |
meta-llama/llama-3-8b-instruct:nitro | 0.641 | 0.434 | 4502 |
openchat/openchat-7b | 0.640 | 0.324 | 4368 |
microsoft/phi-3-mini-128k-instruct | 0.625 | 0.282 | 5886 |
perplexity/llama-3-sonar-small-32k-chat | 0.593 | 0.279 | 4177 |
google/gemma-7b-it | 0.523 | 0.433 | 4633 |