Recent advances in Large Language Models (LLMs) have yielded significant performance gains across NLP tasks, including text summarization. The improvements in these models also necessitate a more nuanced evaluation. In this talk, I will first explore the current landscape of human evaluation in summarization and present a fine-grained protocol and evaluation benchmark for assessing the salience of summaries. This work highlights possible biases in human evaluation in the era of LLMs, underscoring the need for more targeted evaluation. I will then introduce our work that points to issues in existing evaluation benchmarks for factual consistency in summarization and proposes a novel protocol for efficiently creating similar benchmarks along targeted error types. Our resulting benchmark points to gaps in the ability of LLMs to detect factual inconsistencies. To conclude, I will discuss additional challenges and future directions for summarization evaluation.