15 April, 2024
0 Comments
1 category
I. Literature review
- Background
To get familiar with the topic of summarization, and customizable summarization, you need to read through these two
survey papers: - Koh, H. Y., Ju, J., Liu, M., & Pan, S. (2022). An empirical survey on long document summarization: Datasets, models,
and metrics. ACM computing surveys, 55(8), 1-35. https://dl.acm.org/doi/pdf/10.1145/3545176?
casa_token=4CNDOikpOucAAAAA:77ZHfymX0246qPAYZ8u4chHGHO0FlDtjEopUxMD3l92ogiz1tHujHhJCyidDxtpruTq9C - Urlana, A., Mishra, P., Roy, T., & Mishra, R. (2023). Controllable Text Summarization: Unraveling Challenges,
Approaches, and Prospects–A Survey. arXiv preprint arXiv:2311.09212. https://arxiv.org/pdf/2311.09212.pdf - Related work
Find at least 10 high-quality papers that are related to the following sub-topics. For each paper, summarize their
customizable attributes(such as length, topic, etc.), methods, datasets, results:
Customizable summarization for legal documents
Customizable summarization for other domains.
II. Research problems - Assert the statements mentioned below as research problems.
- Rewrite them or elaborate these statements with evidence which is the papers you read in Related work.
Beyond generic summarization, users have different preferences for the summary. For example, users want the
summary output:
to focus on certain topics
different levels of detail
Customizable summarization for legal documents has not been explored.
Customizable summarization is ideal for users in the legal domain.
No existing ML datasets for legal document customizable summarization.
III. Research purposes - Propose a high quality customizable summarization dataset specified for patent documents by leveraging LLMs.
a. Customizing attributes: specificity, section focus, abstractiveness - Propose SOTA baselines on the novel dataset (optional)
IV. Methods - Data description and preprocessing
The patent document dataset (before data annotation) is provided. Link to access.
In your report, you need to cover:
How was the data collected?
Project Description 2
Provide statistics about the data such as # of documents, length, metadata, etc.
What else do you know about the data? What domain does it belong to? - Data annotation with LLMs
You will experiment with different prompt techniques and iteratively refine the prompt to achieve satisfied summaries using
a powerful LLM (such as GPT3.5) according to expected customization attributes.
In your report, you should draw a framework/workflow to show how the data annotation works.
Customization attributes:
What types of customization do you think the intellectual property (IP) users may want to customize the patent
document summary?
Here, we can first focus on 3 customization attributes: specificity, main focus, and abstractiveness, but you are
welcome to suggest other attributes.
Specificity refers to the level of detail and description that the summary should include; you should define rank
values ( normal , high , or others).
Main focus may include the invention description, claims, and novelty of the invention.
Abstractiveness means the proportion of the summary extracted from the source text. Similarly, you should
define rank values ( normal , high , or others).
Developing prompt engineering strategies to generate high-quality customized summaries:
Step 1: Sample about 2-3 patent documents for this task.
Step 2: Develop prompts corresponding to each customization attribute and rank value. Then, use those prompts +
patent documents as input to generate summaries using ChatGPT interface.
Step 3: Manually evaluate the quality of the generated summary on four dimensions below (on a scale 1-5) and
record the evaluation results.
Clarity: Is the summary reader-friendly? Does it express ideas clearly?
Accuracy: Does the summary contain the same information as the original document?
Coverage: How well does the summary cover the important information in the original document?
Customizability: Does the summary meet your customization criteria (such as high specificity, etc.)
For example, the question designed for the clarity:
Step 4: Based on the evaluation results in step 3, refine the prompt and do step 2 and 3. Remember to record the
prompt, summary, evaluation result for each refining round. Repeat step 4 until you get satisfied summaries.
Step 5: After refining prompts and getting the most satisfied summaries in step 4. You need to report several (3-5)
most optimal prompts, customization settings, and corresponding summaries as an example. Below table is for one
optimal prompt + generated summary.
Project Description 3
Generating a customizable summarization dataset for patent documents:
Select the most optimal prompt → now, use GPT-3.5 API to generate customized summaries for each of 1000
patent documents. Save the patent document, patent_id, and generated summary into json or csv. - Evaluate the quality of the generated summarization dataset
Automatically evaluate the generated summaries against the source patent document on some common metrics such
as ROUGE score, BERTscore, SummaC.
On a sample of around 30-50 examples, manually evaluate the quality of generated summaries on the 4 dimensions
mentioned above.
Clarity: Is the summary reader-friendly? Does it express ideas clearly?
Accuracy: Does the summary contain the same information as the original document?
Coverage: How well does the summary cover the important information in the original document?
Customizability: Does the summary meet your customization criteria (such as high specificity, etc.)
I can share my evaluation project on Appen so that you can adopt it for this study. - Instruction-tuning an open-sourced LLM for customizable summarization (Optional)
This part is optional, as all previous works are sufficient enough for a term project.
Using the dataset that we generated, you will finetune an open-sourced LLM for customizable summarization using
instruction tuning. If you want to do this part, I will provide more details later on.
V. Results
To present the results, you need to provide statistics and description about the generated dataset. You also need to
present the evaluation results.
Get Homework Help Now
Category: Uncategorized