Computer Science Assistant Professor Furong Huang Answers Questions About Artificial Intelligence

The College of Computer, Mathematical, and Natural Sciences hosted a Reddit Ask-Me-Anything spotlighting research on ethical AI.

.
Computer Science Assistant Professor Furong Huang promoting her Reddit AMA on Tuesday, May 14. Photo by Katie Bemb.

University of Maryland Computer Science Assistant Professor Furong Huang participated in an Ask-Me-Anything (AMA) user-led discussion on Reddit to answer questions about artificial intelligence (AI).

Huang, who also holds a joint appointment in the University of Maryland Institute for Advanced Computer Studies, researches trustworthy machine learning (ML), AI for sequential decision-making and generative AI (including large language models and autonomous agents). She develops efficient and ethical ML algorithms that operate effectively in real-world settings. She has also made significant strides in sequential decision-making, aiming to develop algorithms that not only optimize performance but also adhere to ethical and safety standards. 

Souradip Chakraborty and Mucong Ding—two computer science Ph.D. students in Huang’s research group—joined Huang to answer questions on Reddit.

This Reddit AMA has been edited for length and clarity.


I'm a software developer. Even though I worked with machine learning a bit, I’ve never seen "under the hood." Is it true that artificial intelligence is really "IF IF IF IF ..." inside?

(Huang) In my intro to machine learning class, I teach in lesson number one that AI is not rule-based programming. It's learning patterns from the data and then trying to make an inference on something you've never seen before. This can be simple things such as image classification pipelines—where your machine learning models will be presented with labeled images of, say, cats and dogs, and then your learning system will learn common patterns from these images with labels of cats and dogs. After training, under deployment, when you have a new image of a cat or a dog that was never in your training example, you can still infer this is a cat or a dog. The model is empowered with a learning ability to understand the world and make generalizations about the unseen future. 

(Ding) It's a very interesting question. We can compel current powerful models into this IF IF IF rule-based system, but it will be very different from IF branches in programming languages. The compelled program from machine learning models will not be interpretable, and you'll see trails of IF conditions. 

With AI-generated content on the rise, do you foresee a garbage-in, garbage-out issue? If so, do you guys have ideas on how the big players will attempt to combat that?

(Huang) That's a very good question. I think the traditional signal processing community often has this perception of garbage-in, garbage-out pipelines. But in machine learning, for example in diffusion models, you may have garbage-in, gold-out. Such models are enabled through very well-curated training data. But there are a lot of issues that arise, including ethical issues of AI/ML models, that are attributed to the bias in the data used to train the model. If we want to build more responsible AI models, we should be careful about the quality of the data the models are built with.

There are copyright issues that these high-tech companies built very powerful generative AI models from data that might be copyright-protected. I believe that companies such as OpenAI are proactive in terms of addressing those. They have a program where they say, if you want to opt copyrighted material out from training data, you can file a request. They verify you are the owner of the data, then make sure everything that's connected to that data is deleted from their training database. There are also issues when models are generating a lot of revenue, this revenue should be attributed to the training data points. People are actively doing research right now on that. 

(Chakraborty) Some companies are combating garbage-in, garbage-out with alignment, which is a method by which they're able to prevent the model from generating garbage. Even if the original model was trained on garbage, we can still protect from generating garbage by aligning human preferences.

(Ding) High-quality starting data is an important part of those LLMs and diffusion models. So we see large companies may fight for these copyright issues and it's indeed an impact to individual content creators. However, with improved capability of large models, there is also a possibility that they can be better than existing technology.

It seems like generative AI will have a lot of negative effects. How does a person with a moral compass work on advancements of these models? Are there any positives that these models will provide that outweigh these negatives? 

(Huang) The deepfake is an example of how bad social actors can utilize data from multimodality such as text and images to create misleading and harmful content. There is a huge research community working to combat deepfakes. We are very aware of that issue, and our group works hard to combat deepfakes from a multimodality perspective. We are working on detectability of AI-generated text as well as AI-generated images using watermarks. Our group strives to create responsible, democratized AI that serves humans, and there is definitely a lot of research to be done in this area. We call out to different sectors such as government agencies, high-tech companies, and academia to contribute more attention and resources toward addressing these issues. As an educator, I feel responsible for educating the general public on coexisting with AI and coping with the potentially drastic changes that it can bring to our lives.

(Chakraborty) AI researchers are also responsible for providing the tools to help. For example, if the detectability mentioned above works well, then we can keep the detector on the website and just remove the content from it. Ultimately, technology is neutral and it depends on the social actors who are using it. 

I am a computer scientist from the area of formal methods. Recently, most companies are interested only in AI and large language models. Will the current AI buzz pass at some point, like the deep learning buzz did, or will large language models indeed revolutionize things?

(Huang) There is a seminar paper from a group of researchers at UC Berkeley, "Diversity Is All You Need," which is basically a philosophy that says that diversity in data will help you in terms of generalization. This is a very interesting analogy that if you have a field where everybody is working on the same thing (such as LLMs),your resilience and generalizability will be compromised to some extent. Diversity is very important, even in this research community in AI and ML. This is why we need thinkers, researchers, funding agencies, and industry partners to be more open-minded and not necessarily only work on the hottest topic. This is important for the resilience and sustainability of the entire AI/ML community. 

AI problems such as ‘hallucinations’ or being ‘confidently wrong’ arise from the AI prioritizing natural and fluid grammar over factual answers, sometimes even for basic arithmetic. Are there any efforts to develop machine learning techniques that actually address the meaning (semantic content) and not just the structure (syntactic content)? 

(Huang) Nowadays, LLMs do understand semantics. As for the problems of hallucination and being 'confidently wrong,' they do exist, especially with spurious correlations and adversarial examples. Our group has recently investigated how to reduce spurious correlation by providing more context to the models so that we can understand where to concentrate when making a decision. We've also been looking at how to improve the robustness of these systems against adversarial perturbations in an ever-changing dynamic system. 

How do we get to AI being able to cite its sources?

(Huang) People are aware of this problem, and there is a lot of ongoing research in that direction. For example, retrieval augmented generation (RAG) basically cites a knowledge base when generating answers to questions.

There has also been research on data models and mechanistic interpretability that strives to cite sources. Simple things such as searching on a search engine could also be useful. 

Given that AI is, in some sense, trained to produce plausible output, how can it be trustworthy? As a scientist I worry that AI tools are tuned specifically to get past our internal gates of plausibility and checks on reasonableness, making it very hard to distinguish real insight from spurious confabulation.

(Huang) Research on hallucination and vulnerability to spurious correlation and adversarial perturbations is important to ensure safe AI. Some may say it is a cat-and-mouse game, as in if you make the model more robust, the attacker can also adapt to be stronger or more malicious. To that extent, there is some research on understanding the possibility and impossibility of robustness, but in general, I believe in defending against dynamic adversaries with adaptability

Do you find that some of the properties you're aiming for can be contradictory? For example, do sustainability, ethics and responsibility get in the way of efficiency and robustness?

(Huang) Sometimes there is a tradeoff between accuracy and efficiency or accuracy and fairness or accuracy and robustness, and so on. I call out to anyone working in the field to have a multidimensional view of how to evaluate your method/model. You shouldn't only care about accuracy, you should care about the Pareto frontiers of a set of metrics that matters.

What current developments in AI are you most excited about?

(Ding) Some recent advancements show that the long-lasting training framework—for example backpropagation and transformers—can lead to models with close-to-human ability on some problems. So we think it's worthwhile to work in this direction and improve this technology and provide benefits to the general public. 

(Chakraborty) I am excited mostly about autonomous agent interaction and how that can impact society. For example, using agent interaction we’ll be able to solve hard math puzzles or write difficult code.

(Huang) While there is hype about the promising future of AI, we should be very careful about its safety issues. If you deploy these capable models and autonomous agents that can implement tasks, the harm can be quite significant. We have some work on adversarial attacks and data poisoning that revealed some of the vulnerabilities of these models. We need to make sure these agents are very safe before we can deploy them.

About the College of Computer, Mathematical, and Natural Sciences

The College of Computer, Mathematical, and Natural Sciences at the University of Maryland educates more than 8,000 future scientific leaders in its undergraduate and graduate programs each year. The college's 10 departments and nine interdisciplinary research centers foster scientific discovery with annual sponsored research funding exceeding $250 million.